MARCUS HØY HANSEN

198
University of Southern Denmark PhD Thesis MARCUS HØY HANSEN DEVELOPMENT OF CLINICALLY APPLICABLE NEXT-GENERATION BIOINFORMATICS Driven by the Genomic Heterogeneity of Mantle Cell Lymphoma for Evidence-based Diagnostics, Disease Tracking, and Follow-up Department of Clinical Research September 2021

Transcript of MARCUS HØY HANSEN

University of Southern DenmarkPh

D T

hesi

s

MARCUS HØY HANSEN

DEVELOPMENT OF CLINICALLY

APPLICABLE NEXT-GENERATION

BIOINFORMATICS

Driven by the Genomic Heterogeneity of Mantle Cell Lymphoma for Evidence-based Diagnostics, Disease Tracking, and Follow-up

Department of Clinical Research

September 2021

DEVELOPMENT OF

CLINICALLY APPLICABLE

NEXT-GENERATION

BIOINFORMATICS

Driven by the Genomic Heterogeneity of Mantle Cell

Lymphoma for Evidence-based Diagnostics, Disease

Tracking, and Follow-up

PhD Dissertation

Marcus Høy Hansen

Department of Clinical ResearchUniversity of Southern Denmark

September 2021

DEVELOPMENT OF CLINICALLY

APPLICABLE NEXT-GENERATION

BIOINFORMATICS

Driven by the Genomic Heterogeneity of Mantle Cell Lymphoma forEvidence-based Diagnostics, Disease Tracking, and Follow-up

PhD Dissertation

Marcus Høy HansenMSc. biomedical engineering (cand.scient.med.)

BSc. molecular biology

Department of Clinical ResearchUniversity of Southern Denmark (SDU)

Haematology-Pathology Research LaboratoryDepartment of Hematology, Odense University Hospital (OUH)

September 2021

SUPERVISORSProfessor Charlotte Guldborg Nyvold (main supervisor), OUH and SDUThomas Stauffer Larsen (Co-supervisor), OUH and SDUProfessor Torsten Haferlach (Co-supervisor), MLL - Münchner Leukämielabor

ASSESSMENT COMMITTEEProfessor Qihua Tan (Chairman), OUH and SDUProfessor Sara Ek, Lund UniversityProfessor Nils von Neuhoff, University Hospital Essen, University of Duisburg-Essen

CORRESPONDENCEMarcus Høy HansenE-mail: [email protected]: 0000-0003-3083-4850

Haematology-Pathology Research Laboratory,Department of HematologyJ. B. Winsløws Vej 15, 3rd fl.5000 Odense C, Denmark

I dedicate this work to my family;

my wife Lea and two sons, Oskar and Storm

Without hesitation, prominence is given to Mads Okkels Birk Lorenzen,wonderful friend, best man, and hematologist.

We shall laugh through it all.

Summary

MCL is a rare and malignant B-cell disease. It is often aggressive if not treated, butindolent cases do also occur. Although recent trials have shown long-term and ongoingremission following CAR T cell therapy for relapsing patients, MCL is still generally con-sidered incurable. It is commonly identified by the characteristic expression of SOX11or inter-chromosomal translocation between an immunoglobulin heavy chain gene en-hancer/promoter and Cyclin D1, leading to a constitutive and elevated expression of thecell cycle regulator protein and thus enhancing B cell proliferation. While several recurrentmutations contribute to the oncogenesis of MCL, such as ATM mutations, lesions involv-ing TP53 or the NOTCH1 pathway, the heterogeneity and small overlap between casesare striking and calls for a specialized strategy, aiming at systematic and high-resolutionpatient profiling. Furthermore, because MCL is defined by both a recurrent and hetero-geneous spectrum of genomic lesions related to its oncogenic profile, it serves as a generalmodel for improved profiling of individual cases in other B cell malignancies.

With MCL as a case model, this project aimed to investigate quality measures and signaldistribution from whole-exome sequencing (WES) to leverage its clinical applicability.Through relatively simple concepts, the NGS workflow and bioinformatics are centeredon practical molecular profiling of a few or individual patients, with the aim to supportdiagnosis and prognosis by maximizing the informational output.

The results derived from the studies showed that the number of false-positive codingsomatic mutations in diagnosis-relapsed paired MCL efficiently decreased an order ofmagnitude by intersecting mutations present in its expressed counterpart, the mRNA,compared to the DNA alone. Thus, together with comprehensive annotation from clini-cally relevant databases, the cell purification and intersection of DNA and RNA provideda powerful tool for eliminating false-positive mutations.

1

While the sequencing reads of WES are not randomly distributed across the codinggenome, the heterozygous variant allele frequencies of high-quality samples were approxi-mately Gaussian. Therefore, it is found that the standard deviation provides an effectivequality measure and can be employed in estimating the low burden of allelic imbalances.Furthermore, the developed methods for detecting copy-number alterations (CNA) showthat the genomic juxtaposition of variants allele frequencies with relative gene-wise cover-ages provides efficient means for resolving acquired copy-neutral loss-of-heterozygosities.In other cases of MCL, with relapse in the central nervous system, I demonstrate thatwhile sequencing batch effects are detrimental for detecting CNA, alternative techniquesmay circumvent such issues.

Through mathematical derivations, I present that detecting minute measurable disease byclonal rearrangements of immunoglobulin genes or somatic mutations is partly a stochasticproblem. Thus, this approach helps explain why a potentially substantial amount of high-quality DNA is required. Essentially, the amount of DNA or sequencing reads requiredfor the confident detection of residual disease must generally exceed the aimed sensitivitylevel by several factors.

In conclusion, capture-based targeted sequencing of the coding genome is a powerful andconsolidating tool for direct clinical implementation if applied critically. The inherentnoise of WES demands high-quality samples and a high depth of coverage to be trulyuseful. Even a very high sequencing depth of targeted sequences will contain technicalbias and stochastic fluctuations.

Marcus Høy HansenHaematology-Pathology Research LaboratoryDepartment of Haematology, Odense University Hospital

2

Acknowledgements

First of all, I wish to thank professor Charlotte Guldborg Nyvold for longstanding supervision andcollegial friendship and my colleagues Oriane, Simone, Sólja, Mia, Sara, Lea, Tine, Marie, andElisabeth. It is difficult to express how much I have enjoyed working with you, Charlotte. And,Oriane, we all owe much gratitude to you for your hard work and skills. I have been supportedand encouraged by my co-supervisors, Thomas Stauffer Larsen (OUH) and Torsten Haferlach(MLL Münchner Leukämielabor), and Jacob Haaber Christensen deserves special thanks for theinterest and support of the group while carrying out own clinical work at the Department ofHaematology (OUH). Also, special thanks to Niels Abildgaard, Hanne Vestergaard, and theDepartment for providing me with the opportunity to carry out research. Please, receive mygratitude. I have financially been supported by the faculty of Health, SDU, and OUH in relationto this thesis. I also thank my longstanding scientific mentor, Professor Peter Hokland (Dept. ofclinical Medicine, AU/AUH), who has openly encouraged me in several matters. Your support,family, and friendship have truly been inspiring.

My work in the field of hematology would not have been the same without previous collaborations,fruitful discussions, ideas, comments, etc., from numerous individuals. Because of this, I wouldlike to thank co-authors and former colleagues from both Odense and Aarhus (in alphabeticalorder), Anni Aggerholm, Anita T. Simonsen, Anne S. Roug, Astrid J. Appe, Bodil L. Ander-sen, Carina A. Rosenberg, Christopher Vejgaard, Dinisha C. Pirapaharan, Dorthe Melsvik, EigilKjeldsen, Hanne Jensen, Hans B. Ommen, Hans H.N. Bentzen, Johanne M. Holst, Johannes F.Sørensen, Karin D. Brændstrup, Laura L. Herborg, Lea B. Hokland, Lene Ebbesen, Line Nederby,Mads Thomassen, Maja Ludvigsen, Marianne Agerlund, Marianne A. Bomberg, Maria Hansen,Marie Bill, Per Ishøy Nielsen, Peter L. Møller, Stephanie Kavan, Karin Meyer, Hanne Jensen,Thomas K. Kristensen, Torben A. Kruse, and many, many others. I thank Dina Mohyldeenfor several keen remarks concerning language. If anyone has been forgotten, please accept myapology.

Finally, I thank my family and Mads OB Lorenzen for their continuous support.

3

1 Dansk lægmandsresumé

Leukæmi og lymfekræft er betegnelsen for kræft i det der normalt udgørkroppens bloddannende celler, og involverer oftest celler i immunforsvaret.Lymfekræft er ondartet tumorvæv, et såkaldt lymfom, der udgår fra speci-fikke hvide blodlegemer, lymfocytterne, men tumoren er ikke nødvendigvisbegrænset til lymfeknuderne. Lymfocytterne er vigtige for dannelse af an-tistoffer i forsvaret mod sygdomsfremkaldende virus og bakterier; ondartedelymfocytter derimod er dysfunktionelle, og infiltrerer hyppigt flere steder ikroppen, såsom milt, thymus, knoglemarv, mave-tarmsystemet og endda cen-tralnervesystemet. Kræftcellerne kan ofte findes cirkulerende i blodet, oger såkaldt leukæmiske. Både ved lymfom og leukæmi dannes der et stortantal af umodne eller abnorme hvide blodlegemer, hvorved normale cellersantal og funktion nedsættes. Symptomerne er bl.a. hyppige infektioner,feber, stakåndethed, udmattelse, vægttab og øget blødningstendens. Uansettype af blodkræft findes der både langsomtvoksende, såsom kronisk lym-fatisk leukæmi, og mere aggressive sygdomme, som eksempelvis akut lymfatiskleukæmi, der uden behandling vil føre til dødelig udgang i løbet af meget korttid. De ondartede blodsygdomme er ofte kendetegnet ved udpræget forskelli-gartethed, også hos den enkelte patient da kræftcellerne hele tiden forandrersig. For at sikre en optimal behandling, er der således behov for effektivtat kunne karakterisere de ændringer i arvemassen, såkaldte mutationer, deropstår som et led i sygdommen.

Mantle celle lymfom (MCL) dækker over en relativt sjælden kræftform med stor diversitet.Selv om nogle mutationer kendes fra andre kræftformer, såsom i genet TP53, optræderen lang række af de sygdomsfremkaldende forandringer kun sjældent i arvemassen hos deenkelte patienter. Motivationerne bag bedre karakterisering af disse lymfomer er mere

4

målrettet behandling til den enkelte patient, øget livskvalitet og overlevelse. Derfor erder i de seneste år kommet større fokus på at kunne karakterisere MCL med henblik påat forstå den underliggende sygdomsbiologi igennem molekylærbiologisk forskning, og atbringe individuelt baseret præcisionsmedicin tættere på kliniske tiltag.

Den udvikling der i høj grad har gjort karakterisering af arvemassen mulig findes indenforsekventering af DNA og RNA, både fra blod- og knoglemarvsprøve eller lymfeknudebiopsi,men også mere præcist fra sorterede kræftceller, der også muliggør præcis analyse af deenkelte celler. Samtidig kan sekventering potentielt nedbringe den økonomiske byrde vedat samle flere hospitalsanalyser i én metode. I løbet af det sidste årti har sekventer-ingsmetoderne fået moment og vundet indpas; først i forskningen og senest hospitalsdiag-nostik. På trods af udviklingen inden for computerteknologien, og delvist drevet af denne,stiller sekventering af arvemassen store krav til hardware og software til analyse af storemængder af biologiske og medicinske data. Det stiller endvidere store krav til forskerenog klinikeren for at kunne vurdere og tolke kritisk på sådanne resultater.

Med denne afhandling, og medfølgende artikelsamling, forsøges der at belyse og konfronterenogle af de problemstillinger der til stadighed hindrer optimal udbytte af sekventeringtil diagnose, opsporing og behandlingsopfølgning. Studiet blev drevet af den udprægedeforskelligartethed hos blodsygdommene og et ønske om at bidrage til at tilvejebringe næstegenerations bioinformatik – i sidste ende til gavn for kræftpatienterne. Jeg gennemgåren række konkrete tiltag for potentielt at kunne forbedre diagnostik og prognosticering afpatienter med MCL, blandt andet ved at sammenligne det muterede gen med dets udtryki kræftcellerne. Herudover vises det at kvalitetssikring af sekventerede DNA-prøver bådeer vigtig, men også forholdsvis simpel, og hvordan små mængder kræftceller kan opfangesved hjælp af DNA-sekventering til overvågning af patientens sygdom. Samtidig fokusererjeg på fund og problemstillinger mere specifikt for MCL, blandt andet hvordan tilbagefaldaf kræftsygdom kan ske fra celler der er biologisk anderledes og endda tidligere udsprungetend ved selve diagnosen.

5

Contents

1 Dansk lægmandsresumé 4

2 Preface 8

3 List of included papers 10

4 Abbreviations & notations 11

5 Motivations & aims 12

I Background informationand main findings 14

6 Introduction and background 156.1 Enter next-generation sequencing . . . . . . . . . . . . . . . . . . . . . . . 176.2 Mantle cell lymphomas as a diverse case model of malignant lymphoprolif-

erative diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 The transition from cytogenetics to cytogenomics . . . . . . . . . . . . . . 22

7 Contextual summary of results 257.1 Paper I Contributions and weaknesses of WES . . . . . . . . . . . . . . 267.2 Paper II High-res assay by cell sorted DNA-RNA . . . . . . . . . . . . 287.3 Paper III Copy altering and copy-neutral lesions . . . . . . . . . . . . . 357.4 Paper IV Circumventing batch effects and lower sequencing quality in

cases involving CNS relapse . . . . . . . . . . . . . . . . . . . . . . . . . . 407.5 Paper V Deduction of the stochastic processes underlying detection of

measurable residual disease . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6

7.6 Paper VI Optimization of sequence processing . . . . . . . . . . . . . . 44

II Theoretical framework & scientific discourse 45

8 NGS quality assessment 468.1 Heterozygous allele frequencies as SNR measure . . . . . . . . . . . . . . . 478.2 Effect of coverage or DNA input on resolution . . . . . . . . . . . . . . . . 478.3 The variant allele frequency distributions . . . . . . . . . . . . . . . . . . 488.4 The special case of low-frequency variants . . . . . . . . . . . . . . . . . . 53

9 Allelic imbalance in the context of chromosomal copy alterations 589.1 Mathematical considerations . . . . . . . . . . . . . . . . . . . . . . . . . 589.2 The effect of sequencing read depth . . . . . . . . . . . . . . . . . . . . . . 609.3 Allelic imbalances and sample quality . . . . . . . . . . . . . . . . . . . . 62

10 Detection of chromosomal copy number alterations by read depth ratios 6510.1 Identifying batch effects and poor quality . . . . . . . . . . . . . . . . . . 6810.2 Genome-wide CNA detection . . . . . . . . . . . . . . . . . . . . . . . . . 71

11 Integrating analysis of copy-neutral losses by the juxtaposition of vari-ant allele frequencies and read depth ratios 73

12 Discussion 8012.1 WES flexibility and limitations . . . . . . . . . . . . . . . . . . . . . . . . 8112.2 Comments on variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8212.3 The role of replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8412.4 Integrative molecular diagnostics . . . . . . . . . . . . . . . . . . . . . . . 8612.5 Copy-number calling in WES . . . . . . . . . . . . . . . . . . . . . . . . . 8912.6 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

III Appendix 92

13 Methods and materials 93

IV Thesis Papers 114

7

2 Preface

When studying the hematologic malignancies, which primarily comprise cancer of theblood, bone marrow, and lymph nodes, one eventually acquires a language, a method-ological approach, and scientific style, which are quite distinct. I plunged into the field ofhematology more than a decade ago and, while I began to grasp the complexity of thisfield, it became more apparent how much still needs to be uncovered and how little oneknows. Despite this, the scientifically collected amount of knowledge relating to the bloodand immune system is marvelous.

Being a researcher embarking on investigations that may lead nowhere requires somenaivety and optimism. When I begin research in new areas or write a new manuscript,I am often blissfully unaware of shortcomings. I feel that this Aristotelian or Lockianfreedom to start with a clean slate is one of the greatest tools of the novice; unconventionalthinking, perseverance, and imagination are other such virtues. You will often find thatyou are not the first to ponder the idea, let alone to publish it. Moreover, good researchdoes not necessarily become good science.

One may tend, intend or pretend to believe that research and science are purely separatedfrom a cultural setting; it is not. The vast number of publications relating to NGS has beenborne both by novelty and great possibilities. The field of next-generation sequencing alsocontinuously shapes and reshapes the research culture, bringing new concepts and terms,such as “the genomic landscape” of specific cancer forms. Deeply rooted terms, such aHodgkin and non-Hodgkin lymphoma, become odd in a genomic age, and the traditionaldistinction or border between malignant myeloid and lymphoproliferative disorders is insome ways being broken and rebuilt, let alone the term “malignant”. Much has been donein a relatively short time span. Apart from the intersection between biology, engineering,and medicine, I am intrigued by science in a historical context. Personally, I find the

8

history of sequencing a fascinating story. It contains scientific rivalry and admiration, theinvention of state-of-the-art technologies, and mind-blowing achievements. It is difficultto estimate precisely, but the total cost of the “finished” human genome by the HumanGenome Project could be anywhere between 500 million to 1 billion USD, according tothe National Human Genome Research Institute (NIH, Bethesda, MD, USA). Others putthis figure thrice as high.

The papers and the illustrative references included in the dissertation represent only a partof the research I have been occupied with. It would not have amounted to much withoutcollaboration with highly skilled and talented colleagues and researchers. Although myaffiliations concerning the project and the six manuscripts that form the foundation ofthe thesis have been carried out during my fellowship at the University of SouthernDenmark (SDU) and Odense University Hospital (OUH), substantial work leading to thisparticular form of presentation has been carried out at two different institutions, both atthe respective Departments of Hematology (OUH and Aarhus University Hospital).

In the last couple of years, I have put much effort into conveying the importance ofcharacterizing the quality of the sequencing data before any biological assertions can bemade. Surprisingly, this has not been straightforward. Apart from technical aspects, muchof my research has been centered on the molecular portraits of hematologic malignancies inthe attempt to establish workflows for more precise diagnoses. Through this, I have beenoccupied by any form of evidence supporting the states of differentiation or indicationsof some precursor capacity in driving a malignancy. I hope to allocate more attentionto this particular theme in my postdoctoral research. While I am often referred to as abioinformatician, I am foremost a biomedical engineer rooted in molecular biology witha keen interest in hematologic malignancies. Naturally, the thesis will reflect this.

As NGS analysis is an extensive topic, I will focus on two main subjects in this thesis,supported by computational theory and suggestions for optimizations. One subject isthe detection of somatic lesions in hematologic cancer forms by discrete changes in thecomposition of bases in the coding regions of a gene and the relative frequency of thesechanges; the other is the subtle change in the relative amount of genomic material. Theseanalytical topics converge to support observations of potential clinical relevance for diag-nosis, prognosis, and detection of residual cancer in the blood, bone marrow, or lymphnode. Finally, the importance of combining genomic variant alleles and copy number ofchromosomal material is underlined by the particular case of copy-neutral events. Thenovelty, if any, lies in particular in the way the NGS analysis is approached.

9

3 List of included papers

I) A decade with whole exome sequencing inhaematology

II) Molecular characterization of sorted malig-nant B cells from patients clinically identifiedwith mantle cell lymphoma

III) CNAplot – software for visual inspection ofchromosomal copy number alteration in cancerusing juxtaposed sequencing read depth ratiosand variant allele frequencies

IV) Distal chromosome 1q aberrations and ini-tial response to ibrutinib in central nervous sys-tem relapsed mantle cell lymphoma

V) Perspective: sensitive detection of residuallymphoproliferative disease by NGS and clonalrearrangements – how low can you go?

VI) Workstation benchmark of Spark Capa-ble Genome Analysis ToolKit 4 Variant Calling(preprint)

10

4 Abbreviations & notations

AML Acute myeloid leukemiaCNA Copy-number alterationCN-LoH Copy-neutral loss of heterozygosityCNV Copy number variationarrayCGH Comparative genomic hybridization arrayDNA Deoxyribonucleic acidKDE Kernel density estimationMCL Mantle cell lymphomaMRD Measurable/Minimal residual diseaseNGS Next-generation sequencingNHL Non-Hodgkin lymphomaRNA Ribonucleic AcidSNP Single nucleotide polymorphismVAF Variant allele frequencyT-ALL T-cell acute lymphoblastic leukaemiaWES Whole-exome sequencingWGS Whole-genome sequencing

* First authorship or co-authorship

11

5 Motivations & aims

Next-generation sequencing (NGS) has entered the scene in hematology, forwhich the technology potentially integrates several laboratory assays into asingle entity. Within the research of hematologic malignancies, studies havemade inquiries into the genomes of dozens, hundreds, or even many thousandsof individuals. Despite this, research has until recently relied extensively onmicroarrays, implemented genome-wide association studies (GWAS), or inves-tigated recurrent mutations using large cohorts of patients. As emphasized byPetersen et al. (2017), looking at opportunities and challenges in NGS, GWAShas been instrumental in the mapping of genotypes concerning disease-causingsingle-nucleotide variants as exemplified by Speedy et al. (2014) on chroniclymphocytic leukemia (CLL) susceptibility. However, such studies are inher-ently biased towards more common variants (Petersen et al., 2017), are beyondreach for many researchers because of the large cohorts required, and are notsuited for the sporadic lesions in cancers, where many somatic mutations arevery rare in the general population. A direct consequence of the formidableundertakings has been a literature focus that still poorly translates into directclinical utilization with the individual patients in mind. Thus, methods forimproved diagnostic profiling are needed, and bioinformatics plays a centralpart in this integration.

NGS, historically also referred to as massive parallel sequencing or high-throughput se-quencing, found its way into hematology with the complete sequencing of a single acutemyeloid leukemia (AML) struck genome more than a decade ago. NGS is in many ways re-garded as a labor-intensive laboratory modality. It still requires technical specialists and,while whole-genome sequencing has passed the $1000 limit and continues to approach the

12

$100 genome, the cost concerning laboratory and computational workflow may effectivelybe manyfold higher. In addition, proper handling and purification of biological samplesand data processing are not trivial tasks.

The complex heterogeneity of mantle cell lymphoma (MCL) (see Royo et al. (2011)) callsfor the continuous development of methods for improved profiling of the individual cases.Currently, one obvious path to more consistent and comprehensive profiling is to imple-ment NGS. Thus, in this project, foremost using MCL as the impetus for developing effi-cient analytical workflows, the aim was to investigate and fortify the role of whole-exomesequencing (WES) as a clinically applicable tool for diagnostic and prognostic purposes.Thus, I sought to leverage NGS bioinformatics to data-driven applicable workflows andnot merely characterize data by predefined algorithms. While confronting this problem, Ifurthermore wanted to investigate the biological heterogeneity of individual cases of MCLusing sorted cells in the lens of potential clinical applications. The attention was keenlydirected against the detection of known and potential molecular markers in relation todiagnosis and prognosis. Lastly, I intended to streamline the bioinformatics workflows torun NGS analyses efficiently on local workstations.

The objectives of the collective study in targeted sequencing were foremost:1) Thematically reviewing the current status and past usages of WES in the field ofmalignant hematology. 2) Optimized workflows for the detection of somatic driver mu-tations at the patient level potentially applicable for improved diagnostics and prognos-tics. 3) Investigating subclonal involvement by WES. 4) Identifying and characterizinguniversally applicable quality metrics, such as the signal-to-noise ratio. 5) Evaluatingcopy-number detection from statistical methods, including copy-neutral loss of heterozy-gosity. 6) Developing sequencing annotation methods. 7) Critical assessment of mea-surable residual disease (MRD) by NGS. 8) Hardware/software performance and speedoptimization.

The importance of the project stems from the fact that bioinformatics used in NGS isstill underdeveloped in specific areas; this calls for a next-generation bioinformatics suitedfor the massive amount of data and continuously developing knowledge of cancer and itscomplex molecular networks. In time, NGS will supersede several current modalities toincrease informational content from laboratory analysis to the benefit of patients anddoctors directly in clinical usage with a potential to decrease the economic burden of thehealth care system. But, on the other hand, NGS contributes to lay the foundation forfuture precision medicine, which in turn may increase the cost substantially.

13

Part I

Background information

and main findings

14

6 Introduction and background

The hematopoietic system is a highly specialized and adapting part of thehuman body. The seemingly frantic proliferation in the renewal of the bloodcells in the bone marrow, lymph nodes, thymus, etc., is coordinated in a tightinterplay between internal and external stimuli. Aptly exemplified, the redblood cells alone have a lifespan of approximately four months, which meansthat replenishment of erythrocytes is provided by the generation of billionsevery single day throughout the life of a human being. This demand is metby the capacity to adjust the total number of cells to ambient oxygen levels,both as a short-term response and for long-term adaptation, and the stagger-ing turnover of cells is maintained by stem cells for the organism to remainin homeostasis (Bernitz et al., 2016). Although it is now known that theseself-renewing stem cells undergo senescence (de Haan and Lazare, 2018), ithas been suggested that the reservoir of hematopoietic stem cells is not dimin-ishing with age but rather increasing (Bernitz et al., 2016). This knowledge,and the fact that clonal hematopoiesis of indeterminate potential is a commonphenomenon in the aging population, helps explain the development of bothmyeloid and lymphoproliferative disorders.

Another example of the diverse biological functions supplied by the hematopoietic stemcells is the highly specialized clonal expansion of B cells. This expansion arises from theactivation of naive lymphocytes, by antigen recognition and differentiation, as a part ofthe humoral response of the adaptive immune system. The tightly controlled proliferationmay eventually lead to a large number of antibodies targeting the antigen, e.g., of viral orbacterial origin, or in some cases even the organism itself. Derailing this intricate systemmay lead to severe defects, both acute and chronic.

15

Within the classical hierarchy of hematopoiesis, the multipotent hematopoietic stem cellprovides a clear-cut definition of lineage-committed myeloid and lymphoid progenitorcells. However, the transition into the genomic era has shown that such a model is toosimplified. Until quite recently, the capacity of malignancies to seemingly cross lineagespecificity has been a conundrum. While still not fully understood, the realization of amultipotent pre-malignant or malignant stem cell helped to shed light on cases of bilinearleukemias, mixed phenotypes, or relapses crossing lineages. Parallel to the introduction ofNGS in the field of hematology, a new diagnostic entity, describing the multipotency or atleast phenotypic traits hereof, was established: Early T-cell precursor acute lymphoblasticleukemia/lymphoma, a rare type of malignancy, was assumed to constitute a few percentof the acute lymphoblastic leukemias and displays a very immature or incomplete T-celldifferentiation.

Under normal circumstances, the hematopoietic stem cells give rise to multipotent cells,which in the case of B or T cells and natural killer cells are the common lymphoid progen-itors (CLP). Early cells of the T lineage migrate from the bone marrow to the thymus asimmature thymocytes, thereby differentiating and losing the capacity to become myeloidand B-lymphocytes. However, seemingly, the myeloid potential is sustained for a moreextended span of differentiation than previously thought (Luc et al., 2012). Thus, themalignant cells may in certain cases carry a phenotype, which reflects both myeloid andlymphoid surface and intracellular markers according to the maturation stage in addi-tion to frequent myeloid somatic mutations, such as in FLT3 and IDH1/2 (Arber et al.,2016).

In contrast to this very immature lymphoid disorder, mantle cell lymphoma is most oftencharacterized as a mature lymphoid malignancy, as will be described in details. As withthymocytes, the CLP also gives rise to immature B cell progenitors, the pro-B and pre-Bcells, within the bone marrow. Here, the cells start to undergo somatic recombination ofthe immunoglobulin genes, whereas rearrangement of the T-cell receptor loci occurs inthe thymus.

16

6.1 Enter next-generation sequencing

The introduction of NGS has had an immense impact on understanding the genomic le-sions leading to malignant myeloid and lymphoproliferative disorders. Now the techniqueinnervates the diagnostic examination and clinical decision making at the departmentsof hematology. More efficient sequencing platforms are continuously being developed to-gether with computational pipelines to process massive amounts of sequencing data andinterpret causal genomic variants. Through the developments in the field and the generaladvances in hardware components, bioinformatics has become a broad discipline, bothin basic and applied science. It is interdisciplinary and draws upon many research ar-eas, such as biology, computer science, medicine, mathematics, and statistics. Currently,the term is associated with its application in NGS analyses, but it may be applied inany other field and often finds its usage in medical science generating big data. Appliedbioinformatics is perhaps aptly described as being an engineering approach through itsproblem-solving iterations. It has become a scientific discipline on its own, with manyproblems still to be solved, such as reducing the error rate in DNA sequencing.

False-positive variants can be detrimental for the interpretation of sequencing results,but it is still an inescapable feature of all NGS assays. With an error rate of 0.1%throughout the human genome of approximately 3 billion base pairs, sequenced to anaverage depth of coverage of 30x, erroneous base calls, polymerase base substitution errors,cytosine deamination (Chen et al., 2014), or computationally introduced misalignment willtheoretically reach almost one-hundred million. While impossible to eradicate, severalstrategies exist to circumvent the high error rate; one of the simplest measures beingsequencing biological replicates (Robasky et al., 2014). As costs continue to decrease(Fig. 6.1) this may become more widespread. A different approach to mitigating errorsis to perform full-transcript sequencing of the coding RNA for correlation (Fig. 6.2).Apart from few reports (Rokita et al., 2019), preferentially in papers with a technical orcomputational focus (Ku et al., 2012), this strategy has received little attention.

While the hemodiagnostic laboratories broadly adopted targeted panel sequencing, oftenamplicon-based, whole-exome sequencing (WES) has still not reached status as a routinediagnostic tool in many clinical laboratories at the departments. Moreover, with thefalling prices, it may be superseded by WGS provided by core centers and thus never findbroad usage. However, the amount of information and knowledge gained from WES, andNGS in general, have been unprecedented.

17

Figure 6.1: Decrease in costs related to the whole-genome sequencing sincethe first draft of the human genome. Prices are estimated in USD (Modified fromWetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome SequencingProgram (GSP) available at www.genome.gov/sequencingcostsdata. Accessed July 2021).

Figure 6.2: Correlation of DNA and full-transcript mRNA may be implementedas a simple approach to remove false-positive and possibly irrelevant muta-tions. One caveat pertains to the possibility of nonsense mutations not being transcribed,which in itself may be informative. In addition, it adds the transcriptional gene profile,which holds potential for diagnostics or prognostics.

18

In this study, the transition from conventional molecular diagnostics and cytogenetics isforemost investigated through targeted sequencing, primarily WES, of a heterogeneous Bcell malignancy: mantle cell lymphoma. In addition, it is put into the perspective of othermalignancies, when relevant, as my general research area over the years has covered bothlymphoid and myeloid disorders. Also, I will very briefly touch upon evidence supportingpre-diagnostic lesions and thus question the term mature B cell neoplasm in the settingof MCL, although this is a theme of currently ongoing and post-doctoral research.

6.2 Mantle cell lymphomas as a diverse case model of

malignant lymphoproliferative diseases

Mantle cell lymphoma (MCL) is defined as a rare and mature malignant lymphoprolif-erative disorder. It is clinically classified as non-Hodgkin lymphoma (NHL), a group ofcancers with an incidence of approximately 1,200 new cases every year in Denmark, andwhich comprises about 3% of all cancers in the Nordic countries (NORDCAN) (Engholmet al., 2010). As of 2010, the prevalence for MCL alone was 1.27 per 100,000 in Denmark(Abrahamsson et al., 2014), i.e., roughly 70 cases annually, and thus corresponding toapproximately 6% of the total cases of NHL. This frequency is similar to that reportedin other Western countries. Several studies have identified that the incidence rate ofNHL is increasing (Fisher and Fisher, 2004; Chiu and Weisenburger, 2003). The termnon-Hodgkin lymphoma may seem rather uninformative, as it comprises more than 60different types of malignancies with a high degree of diversity and is thus an essentialentity within hematology.

According to newer studies, the overall survival has increased at least threefold for pa-tients diagnosed from the beginning of this century compared to cases diagnosed in theprevious decades (Kumar et al., 2019; Herrmann et al., 2009). The improvement is readilyseen from the doubling of survival rates between studies from 1975 to 1986 and 1996 to2004 (Herrmann et al., 2009), owing to both advanced treatments and the improved di-agnostic techniques. Despite these advances, MCL is generally considered incurable, andthe progression-free survival drops markedly following relapses (Kumar et al., 2019). Incontrast to the grim statistics, the advent of CAR T cell therapies provides a prolongedand potentially durable treatment option for relapsed, and refractory MCL patients inparticular, as demonstrated by the ZUMA-2 trial (Wang et al., 2020).

19

An autologous transplant has become an option for younger patients. However, themedian onset of MCL is 60 years, making this treatment beyond reach for many patients.It is often disseminated at diagnosis with a leukemic presentation. Infiltration of thecentral nervous system also occurs but is infrequent (Cheah et al., 2013).

Mantle cell lymphoma is considered a malignancy of differentiated mature B cells, al-though phenotypic features are indicative of maturation stages generally ranging fromantigen-naive pre-germinal cells lacking somatic hypermutations to hypermutated or evenmemory B cells (Queiros et al., 2016). The generation of hypermutations within the lymphnode is extensive in the germinal center (Kolar et al., 2007), which is surrounded by themantle zone. This must be viewed in the light of only a smaller subset of MCL beingmarkedly hypermutated (Navarro et al., 2012; Thorselius et al., 2002). The somewhatdisputed favorable prognostic implications of somatic hypermutation (Lai et al., 2006;Schraders et al., 2009) have been more extensively explored in CLL (Gupta et al., 2020;Davi et al., 2020). While it still needs to be elucidated why an unmutated status is tiedto an adverse prognosis, it is suggested that hypermutations are potentially linked to thepresentation of lymphoma immunoglobulin derived antigens (Khodadoust et al., 2017;Xu-Monette et al., 2019). Curiously, the rearrangements in MCL are biased towards aselective usage of immunoglobulin heavy chain variable gene, as shown by Hadzidimitriouet al. (2011), Thorselius et al. (2002), and other studies.

The translocation t(11;14)(q13;q32) is considered a hallmark of MCL and is often re-garded as the primary feature of initiation. The pathobiological result is a constitutiveand elevated expression of the cell cycle regulator protein, thus enhancing B-cell prolifer-ation. CCND2 or CCND3 is often found to be upregulated when CCND1 is not involved(Salaverria et al., 2013; Wlodarska et al., 2008). Despite not solely being able to drivethe development of lymphoma in models (Lovec et al., 1994) the translocation is, in myown words, obviously tied closely to the differentiation program of the B cell, and thus itsexpressional onset. It is a principal event in initiating the malignant phenotype by repo-sitioning cycle regulator Cyclin D1 to the immunoglobulin heavy chain transcriptionalcontrol. Arguably, it partly explains the specificity and relatively mature commitment ofthe involved B cell malignancies. Indeed, overexpression of CCND1 may also be observedin prolymphocytic leukemia, splenic marginal zone lymphoma, hairy cell leukemia, CLL,multiple myeloma, plasma cell leukemia, etc. Taken into account, several other muta-tions or larger chromosomal aberrations may occur early and even in undifferentiatedstem cells. One striking example of this includes a case where the B lymphocytes werefound being Philadelphia chromosome-positive (Ph+) (Takahashi et al., 1998). Of note,

20

the speculation here is not the first to suggest that t(11;14) may arise well before themalignant expansion of MCL as a latent lesion (Royo et al., 2011), and it has also beendemonstrated in healthy individuals (Lecluse et al., 2009; Hirt et al., 2004), in the samemanner as the highly frequent t(14;18) (Schmitt et al., 2006; Dolken et al., 2008; Ji et al.,1995). Likewise, some leukemias and lymphomas have sporadic involvement of multiplelineages, as may be encountered in CML, AML, or ALL.

Mixed phenotype or bilinearity is not a feature of MCL, although a few cases concurrentwith myeloid presentation have been described (Rodler et al., 2004; Pathak et al., 2017).The work of Holst et al. (2019)* has, among others, cemented that myeloproliferativeand lymphoproliferative malignancies can occur in the same patient. In this perspective,treatment can become difficult because malignant or pre-malignant stem cells are difficultto identify and define, as discussed by Hokland et al. (2019)*.

While MCL is considered to be highly heterogeneous, both in its clinical presentation andcourse (Martin et al., 2017), it is also marked by several recurrent molecular aberrations.Although the heterogeneity cannot be dismissed, it is also a relative concept, which maynot be meaningful in the context of, e.g., melanoma, colon, or lung cancer. At the timeof diagnosis, the malignant cells often harbor many chromosomal lesions and mutations.Some of the most recurrent mutated genes include ATM and TP53 (Bea et al., 2013;Zhang et al., 2014; Wu et al., 2016; Greiner et al., 2006; Schaffner et al., 2000; Fanget al., 2003). Often, a TP53 mutation in the DNA binding domain is accompanied bya 17p deletion, or CDKN2A is deleted or mutated. Collectively, these classical tumorsuppressors are aberrant in the majority of the MCL cases; however, lack of any lesionsin these genes is frequently found in indolent cases (Royo et al., 2011). A TP53 lesion is,therefore, a predictor of an adverse prognosis even in more fit younger patients (Eskelundet al., 2017) and is often indicative of a proliferative malignancy and blastoid morphology(Slotta-Huspenina et al., 2012). As with other hematologic malignancies, the cytogeneticprofile is also of prognostic value. A complex karyotype is a very frequent finding andconfers a dismal prognosis (Khoury et al., 2003; Obr et al., 2018).

The collective combination of molecular lesions is often referred to as the mutationallandscape (Bea et al., 2013; Zhang et al., 2014; Hill et al., 2020; Ahmed et al., 2016; Yanget al., 2018). It includes CDKN2A (Wang et al., 2008; Jares and Campo, 2008; Dreylinget al., 1997; Hutter et al., 2006), NOTCH1 (Kridel et al., 2012), UBR5 (Meissner et al.,2013), etc. The most specific molecular marker of MCL is SOX11, which has been studiedextensively (Dictor et al., 2009; Mozos et al., 2009; Chen et al., 2010; Xu and Li, 2010; Bea

21

and Amador, 2017; Beekman et al., 2018) since it was first discovered to be upregulated inmantle cell lymphoma (Ek et al., 2008). This transcription factor has also been a researchfocus of Nyvold (main supervisor) et al. (Hamborg et al., 2012; Simonsen et al., 2014,2015; Nielsen et al., 2019).

Analogous to CLL, and other hematologic malignancies, MCL has a long list of less fre-quent mutations. While many of the driver mutations in its oncogenesis, such as hits inTP53 or NOTCH1 pathway (Onaindia et al., 2017), have been identified by NGS, thesmall overlap between cases is striking and calls for a different strategy, aiming at sys-tematic and high-resolution single-patient profiling. Because MCL can be described as amalignancy containing a wide spectrum of lesions ranging from very few to comprehensivemolecular lesions in the genome of the B lymphocyte, it serves as a model for improvedcharacterization of individual cases of B-cell diseases in general. Modern diagnosticsmust comprehend this spectrum to increase the efficacy in individualized treatment – theessence of precision medicine. Consequently, the importance of comprehensive molecularprofiling relates not only to MCL but other hematologic neoplasms as well.

Next-generation Sequencing and bioinformatics methodologies described here are not con-fined to a special type of cancer, although hematologic malignancies are well-suited forcell sorting before comprehensive genomic and transcriptomic analyses. Although it ispossible to sort and purify malignant cells in many cases, a central problem of hematologyrelates to the detection and characterization of low disease burden or subclones. Severalapproaches confront this problem, such as developing sensitive algorithms, genomic ampli-fication from a subset or even single cells as performed by Walter et al. (2018) for pediatricAML, MRD by digital droplet PCR, qPCR (Simonsen et al., 2014), or sequencing suchas reviewed by Chase and Armand (2018). I refer to Paper I for a general treatise onthe detection of somatic mutations and short nucleotide variants using WES.

6.3 The transition from cytogenetics to cytogenomics

Cytogenetics constitutes a pillar of hematology, and, in return, the intensive study ofmalignant blood disorders has contributed much to this specific branch of medicine. Themost prominent example is the discovery of a small chromosome in the blood cells ofpatients with chronic myeloid leukemia by Nowell and Hungerford (1960), which later ledto the suggestion by Rowley (1973) that the long arms of chromosomes 9 and 22 were theconstituents of this acquired genomic feature.

22

Chromosomal instability and cellular aneuploidy is a common feature of solid tumorsand hematologic malignancies. Several recurrent aberrations involving copy changes arerecognized as important diagnostic or prognostic features, such as with monosomy 7 or7q deletions in myeloid neoplasms and 3q or 8q gains in subtypes of lymphoid neoplasms.While genome-wide genotyping and genomic hybridization microarray analyses have beenimplemented as diagnostic tools in clinical laboratories, the usage of NGS for detectingstructural variants, such as acquired copy-number alteration (CNA), has not yet reachedfull maturation, and its robustness and algorithms for routine clinical application is stillbeing evaluated.

While the costly sequencing technology was already in place to resolve chromosomal copyalterations in 2013 with the study of 200 de novo AML (Cancer Genome Atlas Researchet al., 2013), copy-number variant (CNV) calling was performed with the accepted tech-nical standard at the time, which relied on the microarray platform. This cardinal papersupplemented that two-thirds of the patients harbored detectable alterations, with a me-dian of only one alteration per genome, and concordance between the results and riskstatus from conventional cytogenetics. In contrast, patients categorized diagnosticallywith unfavorable risk cytogenetics had a median of six copy events. While not relyingon NGS, the study implemented a quite recent technological breakthrough: Followingthe GeneChip Human Mapping 500K Array Set (Affymetrix), which could genotypemore than 500,000 single-nucleotide polymorphisms (SNPs) in a single experiment in2005 (white paper, Affymetrix, Part No. 702087 Rev. 4), the genome-wide SNP arraysproved to combine detection of unbalanced chromosomal DNA and polymorphisms. TheAffymetrix Genome-Wide Human SNP Array 6.0 (Applied Biosystems, Thermo FisherScientific, Waltham, Massachusetts, USA), used in the mentioned study, contained almosttwo million probes distributed across the genome, with the probes being approximatelyevenly shared between copy number probes targeting non-polymorphic DNA sequencesand nucleotide polymorphisms. In comparison, each individual human genome containsseveral million SNPs (Sachidanandam et al., 2001), while the human population proba-bly contains more than 100 million variants (Karczewski et al., 2020; Genomes Projectet al., 2015) – far more than initially expected (Kruglyak and Nickerson, 2001). Thismeans that the usage of arrays, although highly practical, is limited to the most commonvariants. Just covering the coding variants may require several million probes. Empiri-cally, partially based on the research presented here, exome sequencing reveals between15,000–20,000 high-quality variants per sample, varying slightly according to the qualitymeasures used, the number of total sequencing reads, and the library preparation kit.

23

More importantly, the resolution of whole-genome sequencing is not physically limitedby the single-stranded DNA probes on the microarray chip, whereas in-solution target-capture of coding regions for WES is an analogous limiting factor; however, not limitedby the total number of variants. Consequently, the methodological diversity and flexibil-ity, both in terms of included genomic regions and depth of coverage, positioned NGS tosupersede the microarray platform. As the price for even a whole genome has fallen below$1000, several issues are still pertinent to clinical implementation, namely maturation ofbioinformatics for practical and confident detection of structural variants.

It is only possible to address a small fraction of the current practical problems in NGSand bioinformatics in this thesis. Because the center of attention is the empirical data,I dub these efforts next-generation bioinformatics, where I attempt to push some of theadvances in NGS and bioinformatics towards a patient-oriented clinical application withinhematology.

24

7 Contextual summary of results

The results presented in this chapter are directly based on the included pa-pers in this dissertation (Paper I–VI), which has been carried out during theproject period. However, the methodologies and general knowledge on NGSbioinformatics and data analysis concerning hematologic malignancies are alsobased on papers published prior and partly during the commencement of thegraduate program at the University of Southern Denmark and is cited withan asterisk (*). Parts of the collective work have been carried out while beingaffiliated with two separate institutions. Additionally, some of the researchis overlapping or iterative and cannot, in my view, be separated in this over-all attempt and motivation to develop clinically applicable next-generationbioinformatics relating hematologic malignancies. The term itself is intendedto raise the awareness that NGS-based cancer diagnostics demands an inter-disciplinary and analytical engineering approach to sequencing bioinformaticsto be truly successful. As such, a “clinical bioinformatician” must be ableto guide doctors and researchers in how the highest practical resolution ofthe NGS assay can be achieved – starting from obtaining suitable biologicalstarting material to the computational procedures and annotation.

A summary of the important and direct findings from the papers is included here, in somecases put in the perspective of other published papers or manuscripts (*). Consequently,some contextual background, or discussion in relation to general subjects, is intended.The more general or conceptual findings will be covered in the theoretical frameworkchapters. Thus, I direct the attention to Part II for a description of computationalproblems.

25

7.1 Paper I Contributions and weaknesses of WES

The first paper (Paper I) provides a critical overview and qualitative analysis on howexome sequencing contributed to the field of malignant hematology in its first decade ofexistence. While the technique has provided an unprecedented impetus to research andspurred large-scale investigations into the biology of the cancer genomes of leukemia andlymphoma, it has been hesitant to find its way into the routine clinical laboratory. Iargue – which was somewhat speculative at the time of writing the manuscript – thatwhile exome sequencing is now ripe for implementation in the clinical laboratory, it isbeing superseded by whole-genome sequencing (WGS). Several reasons are identified forthis transition. First of all, the price of WGS is decreasing at a relatively more rapid pacethan WES. This discrepancy is partly caused by the proprietary kits of the latter, whichhas enabled the focused and less expensive interrogation of the coding cancer genomein the first place (Hokland et al., 2016)*. Secondly, its laboratory procedure containsadditional steps involving hybridization-based target enrichment (Fig. 7.1) and, finally,the error rate is generally higher than that of WGS (Belkadi et al., 2015; Braggio et al.,2013; Lelieveld et al., 2015; Meynert et al., 2014, 2013; Parla et al., 2011; Warr et al.,2015). As can be empirically demonstrated, the performance and output of the differentkits, such as SureSelect (Agilent, Santa Clara, CA, USA), Nextera/TruSeq (Illumina,San Diego, CA, USA), or SeqCap EZ MedExome (Roche, Basel, Switzerland), are notconcordant in terms of capture efficiencies and coverage profiles.

Furthermore, exome-capture may be biased towards reference sequences, and base substi-tutions may occur during extensive PCR amplification, all contributing to a decrease inthe sequencing quality. The most extreme differences in coverage uniformity are, of course,found when comparing kits involving hybridization capture versus amplicon enrichment(Samorodnitsky et al., 2015). It is also argued that many of the WGS projects are oftenpseudo-exome sequencing, heavily biased towards the coding parts of the genome. This isexemplified by the first sequencing of a complete cancer genome in 2008 involving a caseof AML (Ley et al., 2008).

It is evaluated that when it comes to research, hematological malignancies and pre-malignancies have been positioned favorably, when compared to many other cancer types,because of relative ease of sampling. This attribute has made extensive studies possible,such as in the investigation of clonal hematopoiesis of indeterminate potential, a prevalentfinding in the population of individuals above 65 years of age (Genovese et al., 2014). This

26

Figure 7.1: Schematic overview of the laboratory workflow in WES (Paper I)..Typically, the steps involve the extraction of a sufficient amount of DNA (A), fragmenta-tion of DNA can be mechanical or transposon-mediated followed by the capture of exonsequences (B), and PCR amplification followed by short-read sequencing (C), i.e., Illu-mina’s sequencing by synthesis or Ion Torrent semiconductor sequencing (Thermo FisherScientific).

27

and similar studies have pushed the exome sequencing assays below the practical limitsof detection, exemplified by the study by Jaiswal et al. (2014) with a lower threshold at3.5% variant allele frequency (VAF), which is just above the conceptual definition of sucha mutation (> 2%) (Steensma et al., 2015). The convenience of implementing WES owesto its breadth of coverage of the human coding genome, and this is thus prioritized to itsinherent high error rate at low allelic fractions.

It was also recognized that the extensive sequencing studies involving WES, with a pre-ponderance of the myeloid malignancies in the first years, did not pave its way towardsintroduction into routine diagnostics. Instead, targeted panel sequencing has to a broadextent, and with success, relied on results from these early studies.

Because of these and other considerations, I argue that the history and knowledge pre-sented in this introductory paper (Paper I) are essential for understanding the strengthsand caveats of WES, and to provide a more nuanced discourse of this innovation. Con-cerns relating to the clinical applicability and standardization of WES are not new,as the brief annotation “Can exome scans be expected to be part of real-time decision-making in patients with haematological cancers?” (Hokland et al., 2016)* also emphasized.New research methodologies and concepts have followed the continuous development andwidespread usage of NGS, in addition to new ways of visualizing and presenting large-scale data as well as changes in the scientific language in relation to sequencing projects.Hence, although the most obvious additions to the scientific literature have been of tech-nical and biomedical relevance, a more subtle transition has been ongoing concerning thechanging research culture and its jargon.

7.2 Paper II High-res assay by cell sorted DNA-RNA

In several ways, Paper II represents the accumulation of knowledge from several yearsof research, which enabled us to portrait a high-resolution molecular picture of individ-ual cancer patients relying on NGS. Methodologically, it draws on the techniques chieflydeveloped for profiling a case of a young adult harboring an aggressive immature earlyT-cell precursor acute lymphoblastic leukemia (ETP-ALL) (Hansen et al., 2018)*, thecombination of DNA-RNA in a small cohort of AML patients with normal karyotype(Hansen et al., 2016)*, and the use of exhaustive annotations in investigating twins withmonoclonal B lymphocytosis and CLL (Hansen et al., 2015b,a)*. The previous methodswere streamlined (Fig. 7.2) and implemented in order to demonstrate that cell purifica-

28

Figure 7.2: Graphical representation of the workflow employed in Paper II.

tion and optimized bioinformatics help to portrait the cancer of individual MCL patientsin a more precise manner from a clinically relevant perspective. We set to characterize therelapse molecularly, while screening for potential biomarkers. Although strictly speakingbeing a methods article, it contains additional findings in relation to the biology.

In brief, the results of Paper II show that cell sorting in combination with DNA andRNA sequencing mediates high specificity in detecting driver mutations with a median of17 expressed somatic mutations at diagnosis and relapse (7–24, see supplement Table E1).Annotation was streamlined using the developed software VariFant (unpublished), whichgathers information from multiple databases, such as Catalogue of Somatic Mutations inCancer (COSMIC), PubMed, ClinVar, dbSNP, UniProt, etc. (see example in Table 7.1).The sequencing analyses were based on an average of 164 million and 94 million reads persample from whole-exome and RNA sequencing, respectively, comprising 25 sequencedsamples in total. A very small number of potential clinically actionable mutations or mu-tations in relevant focus genes were found (2–5 mutations, Fig. 7.3), with two of thesebeing lost between diagnosis and relapse (BCLAF p.K178fs and TRAF3 p.G123D, Patient1 and 2, respectively) and a single newcomer (MYD88 p.S219C, Patient 4). The expres-sion patterns of these high-quality sets, drawn at diagnosis and relapse, relative to that ofdonor B-cell control samples, immediately suggested specific expressional biomarkers andmutations relevant for the characterization of the individual lymphomas; One examplebeing the observation of a non-synonymous MYD88 in combination with SPEN nonsensemutation and a chromosome 7q31–32 deletion, not frequently associated with MCL. Notsurprisingly, all four patients had at least one potential driver with a direct or an indirectrole in the NFB or Notch signaling pathways, e.g., TRAF3 and BCLAF (Nagel et al.,2014; Sun, 2012; Zhang et al., 2015; Bushell et al., 2015; Shao et al., 2016).

29

Figure 7.3: The somatic mutations of Paper II were detected from intersecting DNA andRNA and evaluated against public databases (PubMed, COSMIC, dbSNP, and ClinVar,etc.), using the developed annotation tool Varifant (unpublished) to identify potentialdriver mutations of clinical relevance.

30

Table 7.1. Example of annotation of somatic mutation using the developedcross-platform graphical user-interface tool Varifant (unpublished). The mutationin TP53 was accompanied by a deletion of chromosome 17p, and is highlyrecurrent in more aggressive cases of MCL.

Symbol TP53CHROM chr17POS 7578406 (GRCh37)Type Single nucleotide variantREF CALT TZygosity HomozygousCount 66Coverage 68Frequency 97%Consequence missense variantEMAF NoneAmino Acids Position R175HCDS mutation c.524G>AMean Polyphen 0.33Mean Sift 0.08Range Polyphen 0� 0.95Range Sift 0-0.11Max Polyphen 0.66Max Sift 0.11PubMed leukemia 775PubMed lymphoma 294DisGeNET lymph 62UNIPROT Interaction with HIPK1 and ZNF385AClinVar PathogenicCOSMIC COSM10648FATHMM Prediction Pathogenic (score 0.99)

Although the cohort was too small to perform robust statistical analyses of differentialexpression, we identified multiple candidate markers. Several of the genes were found tobe potential classifiers to discern malignant B-cells from peripheral mononuclear bloodcells or CD19 positive cells (CD19+) and provided considerable potential to identifyMCL and CLL when combined, e.g., LILRA4, SYNE2, PTPRJ, and TRANK1 (Fig. 3Band D, suppl. E3 in paper). These specific markers, for which a general concordancebetween paired RNA sequencing and qPCR assays was found (Fig. 3C, suppl. E2

31

in paper), were validated as being significantly altered compared to normal CD19+ cellsusing a larger cohort. Notwithstanding, these findings need to be further investigated,and co-researchers of the group are currently exploring their potential roles. Expressionanalyses also revealed that the largest difference between diagnosis and relapse, in termsof relative expression, was highly enriched in NFB pathway gene members – central toMCL and other B cell malignancies (Kennedy and Klein, 2018). In addition, a singleoutlier was identified (Patient 3, 2nd relapse) owing to inferior RNA sequencing quality(Fig. 3A in paper).

The expressional and mutational profiles supported heterogeneous inter-patient profiles ofMCL, which at the intratumoral level may be more homogeneously defined than previouslysuspected. Such a general attribute is supported by recent findings generated by ourgroup, for which malignant single cell populations of eight patients are profoundly distinctyet display little evidence of subclonal involvement (Fig. 7.4) (Hansen et al., 2021)*.Intriguingly, some of the findings in Paper II point to a relapse clone that is genomicallyprior to that of the diagnostic clone in terms of mutations (Fig. 7.3) and copy-numberchanges (Fig. 7.5, Patients 1 and 2).

32

Figure 7.4: Single-cell transcriptomes from eight patients with mantle cell lym-phoma. Heterogeneous clusters discerned the transcriptomically defined cell populationsof the patients (A), while a profoundly homogenous profile was found within the indi-vidual patients. The only clearly coinciding cell populations from multiple patients wereof SOX4+ pro- or pre-B origin, with both malignant SOX11+ and SOX11- cells (B).The single-cell expression signature in relation to commonly assessed clinical markers wasevaluated for the entire cohort (C). Unpublished data, Simone V. Hansen et al., 2021*.

33

Figure 7.5: Schematic copy-number profiles of the four mantle cell lymphomapatients included in Paper II (A) with a graphical representation of the typical outputfrom VarScan2 showing a trisomy 9 and 17p deletion without any statistical evaluation(plotted with a median filter using Mathematica, Wolfram Research, Champaign, Illinois,USA).

34

Collectively, the diagnostic and relapse samples were marked by general concordancebetween the time points, except for minor differences. In addition, one patient was struckby extensive alterations in chromosomal copy numbers at relapse, with more than half ofthe chromosome pairs being affected (n=14). The genomic instability of this patient, withan apparent cytogenetically complex karyotype at relapse, was not attributed to TP53mutations or deletions; instead, biallelic inactivation of the Ataxia Telangiectasia Mutatedgene (ATM, p.R2691P, p.L15P) may have been the primary contributing factor. Thefindings of this patient case (Patient 1) will be referred to more thoroughly to exemplifythe general detection of CNA.

7.3 Paper III Copy altering and copy-neutral lesions

Several papers have investigated the use of WES for the detection of copy number vari-ation/alterations. Often the conclusion has been that the detection is challenging due tothe introduced noise and bias or that the calling algorithms have suboptimal performance(Tan et al., 2014; Zare et al., 2017; Nam et al., 2016). Belkadi et al. (2015) concludedthat WES is not reliable for the detection of copy-number variation. Although an observedincreased noise compared to WGS is agreed upon, such a statement may be consideredan oversimplification given that multiple factors affect the final resolution, most notablyDNA quality and starting amount, fraction of cells containing the lesion, uniformity ofsequencing coverage, and the depth of coverage. Paper III extends previous studies onCNA detection (Hansen et al., 2018; Simonsen et al., 2020, 2018)* and formalizes sim-ple conceptual and statistical methods. The approach utilizes the juxtaposition of bothvariant allele frequencies and read depth ratios in an approximately one-to-one manner(Fig. 7.8), based on the estimation that one coding variant per gene is expected to bethe most frequent observation (see also Paper I and chapter 11, Part II).

Because the existing WES copy-number tools were initially evaluated to perform subop-timally, software was developed (CNAplot) and tested against VarScan2 (Koboldt et al.,2012), which does not perform any statistical evaluation. The outcome of this devel-opment is described in Paper III, which presents the new software’s additional abilityto detect copy-neutral chromosomal losses based on the previous case of ETP-ALL (seeHansen et al. (2018)*). This ability to statistically pinpoint CNAs and copy-neutral loss-of-heterozygosity (CN-LoH) fast and efficiently was tested using the paired MCL samplesof Paper II. The read-depth ratio profiles of the samples were in agreement with theplotted output from VarScan2. The concept was partially based on the investigations

35

of 38 samples (48.5–188.6 million reads each), in total, from 14 individuals with hema-tologic malignancies or disorders (Hansen et al., 2019)* (preprint)1. The heterozygousvariant allele frequencies assumed a Gaussian distribution in 805 of the 836 autosomes(96%, Anderson-Darling test, p < 0.01). In total, 69 chromosomes were CNA-afflictedwith six samples containing CN-LoH, including a case diagnosed as AML with a normalkaryotype.

In summary, the work of Paper III shows that simultaneous Pearson �2 and WilcoxonMann-Whitney U tests are statistically attractive, when juxtaposed, to detect copy-altering and copy-neutral chromosomal events. Whereas the �2 statistics help to deduceacquired allelic imbalances, the U test assesses the relative changes in read depth to thatof the control sample. The latter was based on the notion that sequencing reads in WESare not assumed to be normally distributed. The power of this technique is emphasizedby the distal part of chromosome 22 of Patient 1 relapse sample (Paper II), where acopy gain with no allelic imbalance is present (Fig. 7.6). This may arise in partial orcomplete tetrasomies where the genotype configuration can be represented as AA|BB, i.e.,four balanced copies, in comparison to the expected A|B (see Gardina et al. (2008)). Co-incidently, this aberrant chromosome also directly exemplified the test’s ability to discernCN-LoH (proximal part in figure). The implementation of these statistical measures inCNAplot was rationalized by their high performance, compared to the computationallycumbersome Fisher’s exact test, and from the argument that reads from control-pairedexome data can be evaluated as independent samples. Generally, the U test requires atleast 10–20 data points, here represented as the approximate number of genes and exonicvariants involved in the test. In contrast, a the change of a single nucleotide variant,e.g., in TP53 (Paper II+IV), may indicate a loss or gain of one allele. However, asubstantial number of false-positive single-allelic changes may occur, and the chosen sig-nificance level generally reflects these. Hence, a threshold (↵) must be selected in relationto the order of variants (n) being tested, e.g. ↵bonferroni = ↵

n 10�4, when typicallyanalyzing several hundreds to thousand variants (Fig. 7.6 and 7.7). To accommodatethe risk of false-positively called allelic imbalances, I introduced the concept of lowestcompound probability, in which stretches of multiple significant changes in VAF are eval-uated together: The reasoning behind this assessment is 1) that the probability of severalconsecutive observations occurring by chance becomes infinitesimal, and 2) it is trivial toimplement in large-scale sequencing analyses.

1Because this work was carried out during the doctoral program and part-time affiliation with twoseparate institutions, influencing the development of the described software in Paper III, the researchresults are briefly included here

36

Figure 7.6: Chromosome 22 from a sample drawn from patient with mantle celllymphoma relapse (Patient 1, Paper II), statistically evaluated and visualized usingthe software CNAplot (Paper III). The proximal part of the chromosome shows strongevidence of copy gain without allelic imbalance, i.e., AA|BB, which is only significantlydifferent in read depth (Wilcoxon Mann-Whitney U test). In contrast, the distal partof the chromosome shows a copy-neutral loss of heterozygosity, also known as acquireduniparental disomy, which holds significant changes in variant allele frequencies (Pearson�2).

37

Figure 7.7: Example CNAplot ouput. The shown figure was exported from the de-veloped CNAplot software, where 14 highly significant autosomes hold copy changes ina case of relapsed mantle cell lymphoma (Patient 2 from Paper II). The compoundprobability is given by the theoretical risk of drawing an observed number of continuoussignificantly altered variant allele frequencies by chance and independently of each other.As each p-value may be low, the compound probability can be extremely low and is astrong indicator of allelic imbalance caused by a copy-number alteration of copy-neutralloss of heterozygosity

38

Figure 7.8: CNAplot was developed to enable a direct comparison between variant al-lele frequencies and read depth ratios (Paper III). These two measures are required incombination to establish the correct copy-number state. The example here shows largecopy-loss and gain of chromosome 8, using the sequencing data from Paper II. In situ-ations where a copy-neutral loss of heterozygosity or a chromosomal gain without allelicimbalance is present, the changes are mutually exclusive (see Fig. 7.6).

39

Statistical inferences were heavily affected by large variance of allele frequencies and rela-tive coverage. The primary factors contributing to an increase in the standard deviationwere low coverage, poor sample quality or sequencing quality, all of which gave rise toboth false-positive and false-negative observations (see chapter 8). In such instances, afilter could practically be applied to the data analogous to signal processing in generalelectrical engineering, such as a median-filter (exemplified in Paper II, supplement Fig-ure E5), or could be implemented in the detection of allelic imbalance from a very fewcells (Simonsen et al., 2018)* (See Fig. 9.5).

Crudely, it was asserted that CNAs at or above one megabase at the genomic scale werereadily detected by WES, while the non-uniformity of sequencing depths in several casesmade statistical inference of CNAs in thousands or even several hundred thousand basesdifficult. The primary factor contributing to the ability to discern a copy gain or loss wassample quality and burden. Principally, it was found that batch effects were detrimentalfor detecting CNA employing read counts but not for detecting allelic imbalance.

7.4 Paper IV Circumventing batch effects and lower

sequencing quality in cases involving CNS relapse

In contrast to the high sample quality obtained by sorted malignant cells of the previ-ously described study, the resolution of samples included in Paper IV was inferior bycomparison. Technically, this study provided an unambiguous example of sequencingbatch-effects that are detrimental to the detection of CNA using changes in read depth(see chapter 10).

The manuscript is a case series reporting relapsing mantle cell lymphoma with infiltrationof the central nervous system. Three of the patients were marked by large chromosome 1qCNA of nearly 100 megabases, with losses involving almost the whole chromosomal armin two cases and a copy gain in one. The response to Ibrutinib treatment in CNS relapsewas brief in the cases harboring the 1q lesion. A fourth patient, with a double-mutatedATM gene at diagnosis, no CNA found, and intraocular involvement of corpus vitreumat relapse, attained complete remission within two months and experienced a sustainedlong-term response to Ibrutinib reaching beyond 60 months. TP53 deletion was detectedin the dorsal tumor at relapse. No malignant cells were found in the cerebrospinal fluid,and pleural effusion the only complication reported in relation to treatment. All fourcases were either struck by either ATM or TP53 lesions.

40

Figure 7.9: Evaluation of chromosome 1 in four cases of mantle cell lymphomapatients with relapse involving the central nervous system. Variant allele fre-quencies demonstrated allelic imbalance on the large part of the q-arm of Patients 1–3.As the type of copy-number alteration could be deduced in Patients 2 and 3 from variantallele frequencies alone, it was supported by comparing read depth ratios. The standardapproach is to implement a tumor/control pair from the same patient. However, in thisstudy, it was not possible due to batch effect and suboptimal lower quality. Thus, thecoverage ratio between the p- and q-arm compared to the expected control sample ratiowas implemented to indicate whether a gain or loss had occurred (B, gray). As severalof the sequenced samples did not harbor any chromosomal imbalances (B, white), thesecontrols were used for direct comparison.

In this study, a slightly different approach to that of Paper II–III was needed in orderto confidently determine the chromosomal copy number state. While it was clear thatsignificant allelic imbalance on 1q could be detected (p 0.05, Mann-Whitney U test,Fig. 7.9A), the previously used methods to detect a change in coverage failed because ofnoise. The results showed that batch effects are detrimental to the detection of somaticCNA by means of change in regional coverage. Instead, it was found that the ratio of thetotal reads obtained internally from chromosome 1p to that of 1q (intrasample), normal-ized to the paired control sample (intersample), provided a robust measure (Fig. 7.9B).This ratio was highly concordant for controls, whereas evidence of deletion was supportedin two of the case samples (Patients 1 and 2) and a gain in the third patient, while nochange was observed for Patient 4. The aberrations were confirmed by qPCR.

Although aberrations of chromosome 1 seem to be rare in MCL, the potential contributionto the development of a CNS infiltrating phenotype needs to be investigated. Currently,there is no other evidence pointing to such a role. The seeming rarity must be evaluatedfrom the knowledge that MCL genomes are often found to have a complex karyotype. Forfurther details pertaining to the detection of CNA, see Part II.

41

7.5 Paper V Deduction of the stochastic processes un-

derlying detection of measurable residual disease

The last study pondered the conceptual problem of tracing minute MRD levels in clonallymphoproliferative disorders based on immunoglobulin heavy chain (IgH) gene rearrange-ments using ultra-deep sequencing. Foremost, it was found that several of the publishedstudies did not use a sufficient amount of cells and DNA. Also, information on the numberof sequencing reads implemented to attain a given sensitivity level was largely absent. Itis evaluated that the lack of transparency in studies relating to MRD makes it difficultto assess the validity of the presented sensitivities.

From a statistical consideration, it is suggested that to be 95% certain of detecting, onaverage, one clonal cell in a million others, the assay must theoretically include at least 3million cells from the simplest deduction. The mathematical assumptions behind randomsampling were based on the Poisson probability distribution (Eq. 7.1), where � is theestimated mean, e.g., � = 1 (per million), and k is the discrete number of occurrences tobe estimated, i.e., 0, 1, 2, etc.

�ke��

k!(7.1)

However, the laboratory assay does not consist of a single step, but two or more subsequentsteps, e.g., extraction of cells or DNA and sequencing with a fixed number of sequencingreads. Each of these steps (n) can be considered a random subsampling of the bulk. Inthe setting of MRD, the main concerns are the probability of true positives, i.e., one ormore clonal sequencing reads (p, Eq. 7.2), and the risk of false-negative observations(q = 1� p).

p(k = 0) = (1� �ke��

k!)n = (1� e��)n (7.2)

Another source of variation arises from PCR amplification for which a basic mathematicalmodel exists (Rutledge and Cote, 2003)(Eq. 7.3). From this, it can be derived that if theamplification rate of the clonal sequence to that of the background IgH rearrangementsare not equal, the effective relative efficiency (E0) will potentially be heavily affected afterjust a few rounds of PCR (Eq. 7.4).

42

nc = n0 · (E + 1)c (7.3)

E0 =nclone · (Eclone + 1)c

nbg · (Ebg + 1)c(7.4)

In the case of a simple two-step analysis, where the subsampling can be considered to becomposed of (1) the cell/DNA extraction together with subsequent PCR amplificationand (2) sequencing, the theoretical probability of detecting a specific clonotype by oneobservation or more is then:

p0(k = 0) = (1� (E0�DNA)ke�E0�DNA

k!)⇥ (1� (E0�NGS)ke�E0�NGS

k!) (7.5)

p0(k = 0) = (1� e�E0�DNA)⇥ (1� e�E0�NGS ) (7.6)

q0 = 1� p0 (7.7)

Thus, in its simplest form, it follows that approximately 20µg of high-quality DNA,corresponding roughly to 3 million cells, is required to reach a sensitivity of 10�6 andto achieve a probability 95% of detecting a clonal cell in a million in a single-step assay.However, in the two-step model with 95–100% relative amplification efficiency, the risk ofa false-negative observation has risen to 10–13%. Conversely, the amount needed for the10�7 sensitivity level is staggering, increasing in a linear fashion.

It is also argued that technical replicates decrease the risk of assay failure and may thusbe implemented instead of increasing DNA amount or sequencing coverage. The simplereasoning behind this is that the effective assay failure rate (F 0) will decrease as a functionof the individual risk of failure (f) to the power of replicates (n): F 0 = fn.

43

7.6 Paper VI Optimization of sequence processing

Paper VI (preprint) investigates the performance of newer algorithms suited for hyper-threaded and parallel data processing concerning NGS bioinformatics. It shows that tech-niques developed for high-performance computing clusters are also suitable for modernCPU architectures, which may even outperform clusters in terms of speed and practicalworkflows. For example, marking and removing sequenced PCR duplicates in the stableparallelized GATK4 Spark version was thrice as fast as the non-Spark MarkDuplicates.Collectively, the more than 10-fold increase in speed of base quality score recalibrationand variant calling provides a means to potentially deliver the output of more than 24exomes/day using a modest 24 logical-core CPU, from alignment to called variants, with-out exhausting the read/write capacity of contemporary solid-state disk drives (NVMe).Interestingly, the Spark-enabled tool tested on a local workstation was faster for WESthan a 72 virtual CPU cluster, using the Google Cloud Platform.

44

Part II

Theoretical framework &

scientific discourse

45

8 NGS quality assessment

This chapter demonstrates that the error rate generally increases for tar-geted sequencing as the variant read-depth decreases. Both amount or qual-ity of DNA influences the distribution of variant allele frequencies and, con-sequently, the standard deviation of heterozygous variant allele frequenciesprovides a powerful and straightforward measure of sequencing quality. Char-acterizing the underlying distribution is attractive to evaluate the burden ofa somatic mutation and potential allelic imbalance in cancer. The knowledgepresented in the theoretical framework is based on the included papers andother recent work that will be cited accordingly.

The large number of data points generated in NGS projects inevitably leads to substantialfalse-positive observations if appropriate measures do not exclude erroneous variants.Unfortunately, there has been little consensus or discourse on assessing or reporting thequality of resulting sequencing reads and thus the validity of reported variants. While it isnot trivial to conclude by NGS alone whether a mutation is truly somatic, variant filteringand systematic annotation may alleviate some of the effects of mediocre or inferior quality.The first problem of finding true somatic variants is diffidently exemplified by the previousstudy involving monozygotic twins with monoclonal B cell lymphocytosis (Hansen et al.,2015b, 2018)*, and the latter quality issue is aptly illustrated by the sequencing of DNAderived from formalin-fixed paraffin-embedded tissue. On the variant level, coverageand genotype quality provides measures, which are not necessarily representative qualitymetrics of the sequenced sample on their own. That is, a sequencing run of high depthdoes not guarantee meaningful biological results. The peculiar transition/transversionratio (Ti/Tv) has previously been suggested as an overall indicator of sequencing quality,and so has the ratio of heterozygous to homozygous variants (Wang et al., 2015).

46

Several factors contribute to a decreased quality, such as a low cell or DNA input, DNAintegrity, the degree of amplification of the nucleic acids in the sample, and more indirecttechnical issues. On the other hand, some contributing factors are systematic and canbe alleviated or circumvented. For example, one pertinent issue is the batch effect (Leeket al., 2010), which is potentially detrimental. The phenomenon is, of course, not confinedto NGS assays, but also microarray (Irizarry et al., 2005) and general laboratory analyses.As follows, an effective quality measure is thus needed for direct comparison.

8.1 Heterozygous allele frequencies as SNR measure

In the human diploid genome, each pair of autosomes is derived from the maternal andpaternal genome. From a sequencing perspective, this means that the germline alleleratios can be represented as discrete values of 1, 1/2, or 0 if absent. Furthermore, becausethe biological signal is sustained despite purification, amplification, and sequencing, thedigital signal holds information on both the true biological signal and the inherent noiseof the laboratory techniques and methods. As such, sequenced heterozygous variantallele frequencies are distributed symmetrically around the biological signal, one-half, andthe standard deviation of heterozygous variants is thus suggested as an effective qualitymeasure in the signal-to-noise ratio of a given sample.

8.2 Effect of coverage or DNA input on resolution

In addition to input DNA amount or integrity, the total number of sequencing reads affectsthe resolution. Formally, the standard deviation and variance decrease as the number ofobservations increases (Eq. 8.1–8.2).

� =

r1

n

qX|x� µ|2 (8.1)

�2 / 1

n(8.2)

How the sequencing resolution is affected by coverage can be challenging to analyze sys-tematically due to sample-wise variation. However, in some instances, the MCL genomeis struck by extensive genomic aberrations, such as exemplified by Patient 1 in PaperII. Such cases may provide an informative and minimally biased view of how coverage

47

of a genomic region and quality are directly correlated. Although the patient was de-fined as 46,XY by clinical laboratory cytogenetics prior to relapse, the malignant cloneexpanded with profound alterations in chromosomal copy-number, counting 14 aberrantchromosomes resolved by WES. Several discrete copy-number states were demonstratedby deletions, trisomies (3N), tetrasomy (4N), and even pentasomy (5N). Hence, the pro-gression was highly suited for a more sophisticated evaluation of the allelic imbalance andrelative change in chromosomal coverage arising from such events. The results directlyunderlined the relationship between coverage and VAF standard deviation (Fig. 8.2).The analysis of allelic imbalance arising from CNA is discussed in next chapter.

The other primary factor affecting resolution, the amount and integrity of the DNA, haspreviously been demonstrated by sequencing of whole-genome amplified DNA from singleand few cell input (Simonsen et al., 2018)*. Briefly, the observations showed that with alower amount of DNA, the number of unfiltered called variants increased (Fig. 8.3A),and that the quality generally increases according to cell input (Fig. 8.3B and Fig. 8.4);with a lower amount of DNA the estimated fraction of actual variants decreases, the riskof allelic dropout increased, and the signal deteriorate. As the laboratory workflow andthe total number of reads were comparable in the study, the difference in variance wasdirectly related to input quality.

Generally, it was found that the collective contributions from DNA starting amount, in-tegrity, sample purity or quality, and coverage to the relative dispersion of heterozygousvariant allele frequencies could be demonstrated across different sample types (Fig. 8.1).

8.3 The variant allele frequency distributions

Statistical characterization of the sequencing VAF distribution is attractive to estimatethe range an observation confidently will be present in, e.g., that roughly two-thirds ofthe variants will lie within one standard deviation from the mean if normal distributed.It will also help to define the uncertainty when assessing the allelic burden of a somaticmutation. Several factors contribute to an increase in the variance, such as amplificationused in targeted sequencing leading to stochastic fluctuations depending on the numberof PCR rounds (Heinrich et al., 2012), and a high variance will increase the false-negativerate of variant calling.

In relation to this project, the characterization of the underlying distribution of het-erozygous variants was based on a compilation of 836 autosomes from 38 whole-exome

48

Figure 8.1: Dispersion of heterozygous variant allele frequencies according tosample preparation. From left to right: dilutions of AML cell line, where 1–5 cells weremanually extracted by micromanipulation (Simonsen et al., 2018)*, bone marrow (BM)coagulum and frozen tissue from mantle cell lymphoma (Paper IV), bone marrow lysatefrom an AML patient with normal karyotype (Hansen et al., 2016)* and fluorescence-activated cell sorting purified cells from a patient with mantle cell lymphoma (PaperII). The last two shown (*) are from the same sample, with the last boxplot having astringent read-depth threshold of 50 in comparison to 30 in the previous sample.

49

Figure 8.2: Direct correlation between the standard deviation and change incoverage induced by aneuploidy.

Figure 8.3: Quality as a function of cell or DNA input. An increase in the repre-sentative number of cells leads to a decrease in called variants (A). Also, the results ofthe study by Simonsen et al. (2018)* indicated that the heterozygous variant allele fre-quencies become normal distributed when cell input increases and the standard deviationdrops (B).

50

Figure 8.4: Signal deterioration of few input cells showing (A) marker dropout frommicrosatellite genotyping (STR) based on 21 different loci from, (B) allelic mismatchaccording to cell input, and (C) phred-scaled likelihood of correctly called variant zygosity.

sequenced samples, including samples of Paper II, thus representing a more broad rangeof hematologic malignancies (Hansen et al., 2019)* (preprint). The data set comprised48.5–188.6 million (median 84.3 · 106) mapped paired-end reads per sample and a me-dian of 19,101 coding variants, in agreement with the expected (Frebourg 2014; Ng et al.2009). The median read depth was 96, and lower and upper quartiles (Q) were 71 and136, respectively.

Whereas the mean chromosomal heterozygous allele frequencies were found to be con-stant (Q[1,2,3] = [0.478, 0.482, 0.486]), with only few outliers (n=7), the chromosomalVAF standard deviations (�, Q = [0.062, 0.068, 0.074]) contained a substantial numberof outliers (49, Fig. 8.5). The majority of these outliers were associated with allelicimbalance, arising from frank CNA.

Because WES data is marked by substantial stochastic noise and technical bias, thestatistical inference was performed using a Monte Carlo approach; by this approach, thesample variant allele frequencies were statistically evaluated against random data samplesgenerated from a Gaussian model (listing 8.1). The approximated normal distributedmodel was based on empirical standard deviations, mean of approx. 0.5, and a numberof observations corresponding to the specific chromosome being tested. In addition, thenumber of heterozygous outliers was defined by the Tukey method (Eq. 8.3). Generally,heterozygous variants were located in a specific range (0.1 < VAF < 0.8).

This Gaussian model was, in general terms, found to be in agreement with the empiricaldata (Fig. 8.6). In total, 96% of the autosomes were found to approximate a Gaussiandistribution (805/836 at 99.9% confidence level, Anderson-Darling test).

51

Figure 8.5: Evaluation of variant allele frequencies in a large cohort of auto-somes (Hansen et al., 2019)*. Whereas � (blue) of heterozygous allele frequencies(0.2 < V AF < 0.8) was comparable between chromosomes unaffected by copy-numberalteration of same sequencing batch, the introduction of a copy gain or low-frequency copyloss increased the �. The mean chromosomal heterozygous allele frequency (red, µAF )did not differ to any significant extent. The frequency symmetry in cases of trisomy leavesthe mean unaltered.

52

1 RandomVariants = RandomVariate[NormalDistribution[mean , SD , variants ];

2 RandomOutliers = RandomFrequency [0.1, 0.8, outliers ];

3 VAFmodel = RandomSample[Join[RandomVariants , RandomOutliers ];

4

5 AndersonDarlingTest[VAFcase , VAFmodel]

Listing 8.1: Robust statistical test for normality using a Monte Carlo method andAnderson-Darling test between case and model generated by random Gaussian variantfrequencies with stochastic outliers. The test is suited for both high and medium-qualitysequencing data.

Tfences = Q1� k(Q3�Q1), Q3 + k(Q3�Q1); k = 1.5 (8.3)

8.4 The special case of low-frequency variants

In contrast to heterozygous germline variants, somatic mutations are often found in thelower range of the allele frequency spectrum, where such variants are difficult to discernfrom technical errors. Typically, the number of erroneous variant calls is found to risesteeply below frequencies of 10% for high-quality exome sequencing with coverage ofapproximately 100x. Whereas the germline SNP algorithm HaplotypeCaller, and formerlyUnifiedGenotyper (GATK, Broad Institute), does not operate on low-frequency variants,the somatic caller MuTect2 does. Thus, this tool is frequently used for the detection ofmutations without a lower threshold.

It may be argued that the high error rate is restricted to WES because of the low coverage.Thus, the often very high depth of coverage in targeted panel sequencing, such as lym-phoid or myeloid panels, may provide additional information on the risk and distributionof false-positive somatic variants. In line with this, the study by Sørensen et al. (2020)*,which comprised 36 patients diagnosed with therapy-related myeloid neoplasms, includ-ing myeloproliferative neoplasms, myelodysplastic syndrome, and AML, investigated thelower VAF threshold from a 30 gene panel. While the depth of coverage varied (Fig.8.7A), the variant allele frequencies were to a large extent governed by the biologicaldistribution of the alleles, being prominent hetero- or homozygous. However, a trimodaldistribution was observed due to the substantial number of variants in the frequencyrange, here below 5% (Fig. 8.7B and C). Zooming in, the number of variants in thelower range matched an exponential rise, symptomatic of a steeply increasing number of

53

Figure 8.6: Empirical data and model comparison. The model of heterozygousvariant allele frequencies listed in 8.1 provides a reasonable description of chromosome-wise data from whole-exome sequencing, evident from a scatterplot (A), boxplot (B), orthe kernel density estimation (C). The empirical data is derived from an AML patientwith a normal karyotype described in Hansen et al. (2016)*.

54

Figure 8.7: Deep panel sequencing of myeloid neoplasms described by Sørensenet al. (2020)* (A). The trimodal histogram of variant allele frequencies (B) is char-acteristic of low-frequency noise (or variants), heterozygous and homozygous variants.The relative percentage of variants rises steeply when reaching 2% (C). (D) Shown on alogarithmic scale.

55

Figure 8.8: Distribution of unfiltered somatic variants allele frequencies inwhole-exome sequencing. Data is derived from samples presented in Paper II. Therise in variants below 10% shows that decent coverage does not necessarily guarantee abiological variant.

false positives (Fig. 8.7D). It is assumed that this phenomenon does not reflect thebiological signal as evidence points to a background noise level that is generally inherentto the method in use. In the same manner, the error rate as a function of VAF can becharacterized for WES (Fig. 8.8). While the general threshold of 10% in WES is cor-roborated by findings in other studies (Shi et al., 2018; Chen et al., 2020), the practicallimit may be as high as 12.5-15% when the resolution is suboptimal. Consequently, theconfident calling of variants with a low allelic burden is challenging, but has traditionallybeen backed by qPCR or Sanger Sequencing (Cleveland et al., 2018).

From the empirical analyses, I find that the main factors for mitigating false-positive vari-ants are sample purity, such as obtained from sorting, cell or DNA concentration/amount,and the total number of mapped reads. Also, post-processing plays an important role, no-tably discarding non-paired-end observations, providing a coverage threshold (e.g. 30x),algorithmic confinement to targeted regions during alignment, variant calling, or after aslisted in 8.2, and filtering on the basis of annotation as suggested previously (Hansenet al., 2015a)*. A completely different approach to mitigating error in order to find low-frequency mutations is based on the ingenious addition of unique molecular identifiers toeach DNA molecule. Although I will not detail how to exploit this in WES, it is alreadyfinding wide usage through single-cell sequencing.

56

1

2 SelIntervals = Function [{chr , vcf , bed},

3 VCFsubset = Select[vcf , #[[1]] == chr &];

4 BEDsubset = Part[Select[bed , #[[1]] == chr &], All , 2 ;; 3];

5 BEDsubset = Apply[Interval , BEDsubset ];

6 Select[VCFsubset , IntervalMemberQ[BEDsubset , #[[2]]] == True &]

7 ];

Listing 8.2: Restriction of variants to targeted regions obtained from sequencing provideror sequencing kit manufacturer is a simple, but crucial step in the analyses of capture-based sequencing output. The code is written in the Wolfram Language.

57

9 Allelic imbalance in the context

of chromosomal copy alterations

Chromosomal aberrations are common features of hematologic malignancies.Several recurrent aberrations, such as chromosome 17p deletions, are rec-ognized as important diagnostic or prognostic markers. While genome-widegenotyping and genomic hybridization microarray analyses have been imple-mented in clinical laboratories for several years, the usage of NGS for thedetection of acquired CNA has not yet reached full clinical integration. NGScombines two previously biological assays, the SNP microarray and compar-ative genomic hybridization array (arrayCGH), into a single assay.

The bioinformatics required to call CNV from NGS are generally divided into two branchesanalogous to the genome-wide SNP arrays: 1) analysis of imbalance in variant allelefrequencies and 2) changes in read-depth ratios. This chapter will deal with the first,whereas the latter is the subject of the next chapter.

9.1 Mathematical considerations

Despite the wide range of scientific papers published the last decade involving NGS,almost none addresses the underlying distribution of the sequencing data in detail. Iadvocate that such metrics generally must be performed and included for all sequencingprojects. The insights gained from such scrutiny may help alleviate erroneous conclusions– especially when NGS is taken to its extreme. Examples of error-prone usage are se-quencing of PCR amplified low input material or of poor quality, e.g., single-cell assaysor formalin-fixed paraffin-embedded tissue. In the following section, I will elaborate on

58

the topic introduced in the previous chapter through logical deductions relating to VAFdistributions.

In the previous chapter, it was confirmed that the allele frequencies of heterozygousvariants, depending on the number of sequenced reads and general quality, do assume anormal distribution in spite of increased bias and non-uniformity of reads compared toWGS (Belkadi et al., 2015; Meynert et al., 2014). A practical usage of this observationis the estimation of low frequency in copy-number gains through mathematical analyses.Using imbalance arising from a simple trisomy will provide an example in the followingdeduction, from which burden can be estimated from WES even when VAF distributionis not discernibly bimodal.

If the chromosomal heterozygous VAF distribution is Gaussian (Eq. 9.1), then theinflection point is located at ±�. It follows that for a trisomy to be discernible as abimodal VAF distribution, Gaussian in each of the separate unimodal components (Eq.9.2), the modes containing reference and alternate alleles must be separated more than2�. A trisomy present in all cells will present a distinct unimodal distribution with modesat 1/3 and 2/3, which can be generalized for all copy-states (n, Eq. 9.3).

Because the unimodal distributions describing a copy-gain or low burden deletion aresymmetric, the outer inflection points can be used in assessment of the burden. Thesepoints, xmax and xmin, are readily discernible from the derivative function (Eq. 9.4)with xmax and xmin computationally found at max(f(x)0) and min(f(x)0) (Fig. 9.3).Now, the estimated burden of the allelic imbalance can be calculated (Eq. 9.5), andpotentially be correlated with the copy-number profile assessed from read-depth ratios(described in next chapter).

f(x) =1p2⇡

e�12 (

x�(µ)� )2 (9.1)

f(x) =1p2⇡

e�12 (

x�(µ��)� )2 +

1p2⇡

e�12 (

x+(µ��)� )2 (9.2)

N =

"n�1

1� n�1

#(9.3)

59

Figure 9.1: Theoretical bimodal distribution as described by Eq. 9.2. When themodes of two individual normal distributions (stippled lines) are separated less than 2� itis not possible to discern whether the underlying distribution is unimodal or composite.

f(x)0 = � (e�(���µ+x)2

2�2 (��� µ+ x))p2⇡�2

� (e(��µ+x)2

2�2 (�� µ+ x))p2⇡�2

(9.4)

CNA%

100⇡

xmax � � � 12

12 � 1

n

⇡12 � xmin � �

12 � 1

n

, xmax > � +1

2, xmin <

1

2� � (9.5)

9.2 The effect of sequencing read depth

As was shown in the last chapter, the variance of heterozygous variant allele frequenciesis directly proportional to the number of reads. To recapitulate, because the depth isnot the only contributor to the signal-to-noise ratio, it is not trivial to demonstratethis effect between samples. The mantle cell lymphoma relapse of Paper II (Patient1) provided such an opportunity, with a pronounced aneuploid clone ranging from 1–5chromosomal copies in distinct regions of the coding genome. Thus, this case providesa direct correlation between VAF and copy-number, i.e., allelic imbalance, the standarddeviation as a function of reads, and finally, an immediate increase in coverage as afunction of copy number. Because the standard deviation, and thus the resolution, isinversely proportional to the square root of the sample size, i.e., median sequencing reads

60

Figure 9.2: Theoretical bimodal distribution of heterozygous variant allele fre-quencies characteristic of a single copy-gain at 100% burden or copy-loss at33,3%.

Figure 9.3: Derivative variant allele frequency distributions. The three functionsshown are based on Equation 9.4 with parameters µ = 0.5, � = 0.08, � = 0 (blue,unimodal), � = � = 0.08 (orange, unimodal, e.g. trisomy at approximately 50% burden),and � = 1/6 (orange, bimodal, e.g. trisomy at 100% burden or deletion at 33,3%)

61

Figure 9.4: Correlation of variant allele frequency and copy number (A) in relation tostandard deviation as a function of coverage (B).

nmed, it is not surprising that the non-linear regression of empirical data directly reflectsthis relationship (Fig. 9.4). As the range of VAF values is 0–1, then a completelystochastic outcome would be characterized as �max ⇡ 1. However, because technical biasis inevitable, then �0 6= 0 (textbfEq. 9.6).

�(nmed) = �max1

pnmed

+ �0 ⇡ 1pnmed

+ �0 (9.6)

9.3 Allelic imbalances and sample quality

It is now clear that detecting allelic imbalance from WES depends on sufficient qualityand reads. When only a few cells or a limited quantity of material is being investigated,extensive amplification is often required to attain enough material for sequencing. Tworecent collaborations exemplify this, one involving characterization of the signal-to-noiseratio in single-cell and sparse-cell sequencing referred to previously (Simonsen et al.,2018)* and the other involved copy-number profiling of sorted cell subsets with less than1000 cells (Simonsen et al., 2020)*. In both studies, signal processing was implementedto mitigate background noise, partly based on a median filter. In the first study, thediscernible allelic imbalances of the AML cell line (Fig. 9.5), OCI-AML3, were found tobe in general agreement with the 24-color karyogram (Fig. 9.6).

62

Figure 9.5: Allelic imbalance introduced by copy-number alterations in singleand few cells. Exome sequenced cells derived from OCI-AML3 cell line were comparedagainst the 24-color karyotype and previously published cytogenetic profile. A markeddeterioration was observed as the cell input decreased. The noise-filtered variant allelefrequencies were defined as AFmed = MedianFilter[µ+Abs[AF � µ], 100]

63

Figure 9.6: 24-color karyogram of the OCI-AML3 cell line.

More Information on methods and statistical tests implemented for the detection allelicimbalances is provided in Paper III, chapter 11 and preprint paper (Hansen et al.,2019)*.

64

10 Detection of chromosomal copy

number alterations by read depth

ratios

The assembly of a human genome is based on a contiguous overlap of read sequences.Ideally, sequence fragments will be randomly and independently distributed across thewidth of the genome (G). In such cases, it is relatively trivial to estimate the averagedepth of coverage (C) from the number of fragments (N) of specified length (L).

C =LN

G(10.1)

This problem was approached by Eric Lander and Michael Waterman (Lander and Wa-terman, 1988), who did not focus on NGS workflows, as the technology was not inventedyet, but on gaps in DNA cloning. However, Eq. 10.1 does not provide information onhow many sequencing fragments are required to obtain a contiguous string of bases thatrepresent the human genome end-to-end. Reads may, of course, overlap and this makesit possible to generate a more or less complete map of the individual chromosomes. Thestochastic process of how sequences are independently scattered across a region can be as-sumed to follow a Poisson distribution so that the probability of any base to be sequencedto a specified depth (d), considering the known average coverage (C, Eq. 10.2).

P (d, C) =Cd ⇥ e�C

d!(10.2)

65

Thus, the theoretical model predicts that if a whole genome is sequenced to a mean cover-age of 4x, the bases are covered four times in only 20% of the hypothesized observations.It also hints that there is a 98% chance that any given genome position is covered atleast once. The probability of any given position to be covered can be generalized as Eq.10.3 by the total number of sequencing reads (N), as we know the read length, or atleast the median length, and the size of the human genome is roughly three billion basepairs.

P (N) = 1� e�LNG (10.3)

Because some regions contain repeats and are difficult to sequence, such as centromeric re-gions, this mathematical description is an oversimplification. However, such models are of-ten highly practical when investigating the genome and developing new algorithms.

Assessment of change in regional coverage compared to a suitable single control sample,or a control generated from multiple samples, is one of the most common approachesto detect chromosomal CNA. While sequencing reads of WGS are generally assumed tobe distributed randomly across the genome, with each read pair being an independentobservation, this does not apply to capture-based or amplicon sequencing. In WES, a bell-shaped distribution is often observed, where the flanking nucleotides of a targeted regiondisplay a more shallow coverage than is found in the center of the exon. However, it isassumed that the distributions of normalized sequencing reads between two samples, oftencase-control paired, are comparable when the same target enrichment is used. In WGS,the regions are often defined as bins in the range of kilo- or megabases, while it is morepractical to consider targeted regions, such as exons, when using capture by hybridizationor PCR amplicons. Because of the relatively small size of the exome, constituting justover 1% of the entire genome, the majority of flanks or breakpoints of structural variantswill lie outside of the coding regions (Kadalayil et al., 2015). The approach described inPaper III is to map and count reads to gene intervals to assess the ratio of a specific geneand, subsequent, to perform a Wilcoxon rank-sum test on groups of n successive genes(e.g. n = 50) (Fig. 10.1). It was found that the statistical tests are heavily affected bysequencing batch effects.

66

Figure 10.1: Partial copy-number gain of chromosome 3 comprising approxi-mately 140 Mb. The gain was highly significant using U test and �2 test. It is also seenthat, even though the sizes of the p and q-arms are of comparable, the proximal regionsof chromosome 3 are particular gene-rich compared to the more distal copy-afflicted part.Thus, the relative size is not necessarily meaningful without genomic coordinates.

67

10.1 Identifying batch effects and poor quality

Small differences in laboratory workflow may influence the end-results and ultimatelylead to faulty conclusions. The most extreme factors are running on different instrumentsand with different kits in the wet lab, but more intricate are perhaps the batch effects,which have no known sources other than different sequencing runs and day-to-day varia-tion. While some attempts have been made previously to alleviate batch variations, suchas using a principal component analysis (Shi and Majewski, 2013), it is often detrimen-tal.

Whether a sample-control pair is compatible for detecting CNA can be assessed by directlycomparing the number of reads each region has attained. Counting the number of readsfor each gene can be performed with, e.g., BEDTools (Quinlan and Hall, 2010), whenproviding the sequence alignment and gene intervals (command line):

1 bedtools multicov -bams tumor.bam -bed gene -regions.bed > tumor -

genecoverages.bed

The resulting data points are then visualized gene-wise in a regular scatter plot withno need for normalization (shown in Wolfram language, Wolfram Research, Champaign,Illinois, USA).

1 result = DeleteDuplicates[

2 Transpose [{

3 Part[Import ["tumor -genecoverages.bed", "TSV"], All , -1],

4 Part[Import ["control -genecoverages.bed", "TSV"], All , -1]

5 }]

6 ]

7

8 ListPlot[result]

The resulting correlation scatter plot and Spearman correlation coefficient are highlypractical for the collective assessment of DNA and sequencing quality and to assesswhether the paired samples are suitable for the detection of CNAs employing read-depth ratios. Also, the scatterplot will reveal large copy-number changes (Fig. 10.3).Previous work (Simonsen et al., 2020)* suggests that even WES reads distributionsfrom small PCR amplified cell subsets, consisting of approximately 50–500 purified cells(CD45lowCD34+CD38�CLEC12A+/�) from patients with a myelodysplastic syndrome,can be highly concordant and reproducible (Fig. 10.2), when batch-effects are avoided.

68

Figure 10.2: Demonstration of batch effect in three control-paired whole-exomesequenced AML samples. Regional coverage correlation dismiss batch-effects in thefirst two samples, but is unmistaken in the third. The data were derived from the studyby Simonsen et al. (2020)*.

Figure 10.3: Scatterplot of reads between paired B cells (7AAD-CD3-CD19+light chain+CD5+CD20+FMC7+) from mantle cell relapse patientsample and paired T cells (7AAD-CD3+). The correlated gene-wise sequencingreads immediately reveal five different copy states, high sample qualities and compatibil-ity. Data were derived from Paper II.

69

Figure 10.4: Chromosome-wise view of batch-effects by read-depth ratios. Thetwo upper panels show cases of control-paired MCL samples derived from Paper IV,whereas in the lower is a high-quality example the case-control samples were sequencedtogether.

It is important to emphasize that while a single gain or a loss both imply a 50% change inthe number of copies, the relative change from disomic state to a trisomy is considerablylower, reflected in the signal. Thus, the SNR is generally higher for deletions than fortrisomies of equal burden. The effect on allelic imbalance is even more profound, ascovered in chapter 9. Consider the ploidies n�1, from one to two copies versus two tothree copies (Eq. 10.4).

1�1

2�1>

2�1

3�1(10.4)

The investigated material from patient cases of CNS relapsing mantle cell lymphoma wasgenerally suboptimal (Fig. 10.4) for direct comparison of CNA as described. Thus, amethod was devised, which was more robust towards inconsistent quality and batch ef-

70

fects. Briefly, the strategy implements an intra-sample control region, which in this studyinvolved the relative comparison of the q-arm of chromosome 1 to that of the p-arm.Because the expected relative coverage is not known from the potentially aberrant chro-mosome alone, the ratio is normalized to that of a paired control sample (Radjusted), whereN denotes the total number of sequencing reads positioned in the specified regions.

Radjusted =Nq,case ·Np,normal

Nq,normal ·Np,case(10.5)

10.2 Genome-wide CNA detection

Generally, it is asserted that WGS is more powerful to detect genomic aberrations whencompared with regions covered by WES (see Paper I). It is assumed that the latter isphased out to cover the whole genome in a more unbiased and, in time, more practicalfashion. This advancement is especially relevant for cytogenetics. Thus, in parts of mywork, I have attempted to extend the approaches developed for WES to WGS (unpub-lished, see Appendix). Deduced from mentioned Poisson distribution and non-parametriccomparison differences between paired samples, it is theoretically possible to detect a tri-somy with a burden of 50% with approximately 95% certainty (p<0.05, Mann–WhitneyU test / Rank sum test) from individual regions of 1000 bases (Fig. 10.5). However,empirically, reads may cluster together in smaller regions, e.g., 500–1000 bases. Conse-quently the analyses must typically include intervals of 10.000 bases or more. The optimalinterval size may be found by sliding through a single sample in fixed window sizes of e.g.,1–10 kb performing statistical tests between two co-sequenced paired control samples. Forthe initial testing rounds, I will recommend 100 paired observations of 10 kb each to matchthe 1 Mb resolution of arrayCGH. The regional reads must be normalized to the totalnumber of sequencing reads, as exemplified by the R script in the Appendix. This scriptextends some of the methods presented in this thesis relating WES and optimizations forthe analysis of many whole-genome samples in parallel.

71

Figure 10.5: Statistical power to resolve whole-genome copy-number alterationsusing Monte Carlo simulations. The figure shows that even with very low coverageof 0.5–1x and regions of 1000 bases each, the theoretical power to discern a copy-numberloss or gain is high. The test does not assume the sample-control pairs to be matchedsamples as reads are randomly distributed and unrelated

.

72

11 Integrating analysis of copy-neutral

losses by the juxtaposition of

variant allele frequencies and

read depth ratios

Detection of net increase or decrease of chromosomal material constitutes adiagnostic pillar of laboratory hematology. Often, the cytogenetic profile isused in the prognostic staging of the patient harboring a specific hematologicmalignancy. Although cytogenetics has been an inseparable part of routinediagnostics for a long time, the motivation of the research presented in thischapter is supported by the fact that acquired CNA do not always contributeto effective aneuploidy. CN-LoH is a common genomic aberration in hemato-logic malignancies (Arenillas et al., 2013) but poses some problems concerningdetection. While it is true that SNP microarrays can be used to detect CN-LoH (Arenillas et al., 2013; Gondek et al., 2007), it cannot be identified as suchwithout accompanying arrayCGH. Supersession by NGS has the advantage ofanalogously combining these arrays. This combination was also developed forarrays (Wiszniewska et al., 2014) at a time when NGS was rapidly enteringthe biomedical research laboratories. Although these genome-wide chromoso-mal microarrays have the potential to replace or supplement other techniquesin the cytogenetic laboratory, as described by Berry et al. (2019), it will in-evitably give way for NGS approaches. One of the limitations of the previoususage of SNP arrays is a general bias towards common polymorphisms (Yu

73

Figure 11.1: Three leukemia patient cases harboring copy-neutral deletions on chromo-some 8 and 14 (A), chromosome 9p (B), and chromosome 13 (C) from (Hansen et al.,2019).

et al., 2018), while most driver mutations in hematologic cancers are rare inthe general population. Collectively, NGS has the potential to encompassdetection of balanced and unbalanced translocations in addition to somaticmutations, SNPs, chromosomal copy-number alterations or variation, geneduplications or deletions, and CN-LoH in a single assay.

The motivation behind Paper III and preprint paper (Hansen et al., 2019)* was toextend the general detection of CNAs by exome sequencing to the subtle distinction ofcopy-neutral chromosomal losses. The relevance is justified because current sequencingread depth correlations, of which many tools exist, do not detect this type of deletions.A few tools are currently available, which combine VAF and coverage analyses, but theapplicability generally lacks transparency or is no longer supported, such as ExomeCNV(Sathirapongsasuti et al., 2011)(last update 2012, archived 2013). Whereas the algorithmcan be approached by different means, Paper III formalizes the method by a simplestrategy: On the notion that the human genome contains approximately 20,000 protein-coding genes and roughly an equal number of coding variants, the VAF and gene coverageeach provide discrete datapoints for direct comparison of the chromosomal imbalances.One strength of the method is its reproducibility which can be explained in a few lines,relying on unpaired Mann-Witney U test for comparing depth of coverage and �2 or

74

SamplesCase Normal

Allele A DC,A DN,A

B DC,B DN,B

Table 11.1: 2⇥ 2 contingency table of case-control paired allele depths (Dsample,allele)

Fisher’s exact test for detecting allelic imbalance. Because CNAs often affect multiplevariants/genes, compound probabilities can attain infinitesimal theoretical values, as for-malized in the following approach (see Paper III for more information). In situationswhere stretches of probable and significant allelic imbalance are found, without significantchange in coverage, there is a strong indication of CN-LoH.

Let Dsample,allele define the case-control paired allele depths (Table 11.1). Then the �2

statistics (Eq. 11.1) and probability (Eq. 11.2) for each locus collectively provides thetheoretical compound probability (Eq. 11.3) for the given outcome to occur by chance.Because the number of simultaneous tests is large, given by the number of variants onthe chromosome being analyzed (n ⇡ 1000), an adjusted significance level on the variantlevel is recommended, such as Bonferroni correction (Eq. 11.4).

�2Pearson =

N(DC,ADN,B �DN,ADC,B)2

(DC,A +DC,B)(DN,A +DN,B)(DC,A +DN,A)(DC,B +DN,B)(11.1)

p = 1� CDF (ChiSquareDistribution(df = 1),�2Pearson) (11.2)

pcompound =nY

i=1

p(i) (11.3)

p ↵/n (11.4)

The research on CN-LoH was spurred by previous work, where an immature T-cell acutelymphoblastic leukemia case (Hansen et al., 2018)* harboring a mutation in CDKN2A wasaccompanied by chromosome a 9p deletion (Fig. 11.1B) – none of which were resolvedby the diagnostic procedures.

75

Of the previously described copy-number test cohort (see chapter 8), six of fourteenpatients with hematologic malignancies were found to harbor seven copy-neutral deletions(Hansen et al., 2019)*. One of these patients was otherwise characterized as CN-AML(Fig. 11.1C) (Hansen et al., 2016)*. In cases of very low burden allelic imbalances, itis evaluated that the Brown and Forsythe robust test (1974) of equal variances may beimplemented for screening using paired control samples as exemplified by the reevaluationof a previous study involving monozygotic twins with monoclonal B cell lymphocytosis(Hansen et al., 2015b)* – also with a low-frequency deletion of chromosome 13. Becausethe 13q deletion is regarded as the most common cytogenetic aberration in CLL (Van Dykeet al., 2010) and may imply prognosis, it is important to detect CN-LoH, which effectivelyis comparable to a frank deletion. In the discussed case, where both twins were affectedby lymphocytosis, it may have been implemented as a molecular handle for follow-up inaddition to B-cell count and immunoglobulin light chain usage.

Regardless of the method in use, there is a direct correlation between the ability to detectan aberration as being significant, assay resolution, and the fraction of affected cells.Thus, if DNA library and sequencing quality are disregarded, the assessment of allelicimbalances depends directly on coverage and the allelic fraction (Fig. 11.3). This impliesthat duplications are more challenging to resolve than deletion or loss-of-heterozygosity. Inorder to detect a copy-gain confidently, it must be present at a high burden with coverageand low variance. Using the theoretical foundation of Fig. 11.3, a resolution of 100x witha VAF standard deviation of 0.08 or less and clonal frequency of 90–100% may be sufficientto accommodate a significance level of 0.01–0.05. Also, the genomic size of the allelicimbalance is crucial as it directly influences sensitivity. The methodology and output ofsorted malignant cells included in Paper II serve as an illustrative example, where thegeneral resolution was high and requirements met. The chromosome 22 profile of Patient1 MCL relapse sample (Fig. 11.2C) briefly summarizes some of the subtle problems whenperforming sequencing – or array analyses for that matter. The results support a clonalgenotype of AABB extending 4–5 million bases proximal to the centromere (22q11.21)and AA (or BB) in the remaining distal part. The implication is that the first will notbe identified as an allelic imbalance but imply a gain of estimated four copies (see Fig.9.4). Copy gains in this region are not frequently recurrent in hematology (Chang et al.,2011) but are associated with congenital syndromes with comorbid hematologic disorders(Vaisvilas et al., 2018; Chang et al., 2011). In contrast, the latter part strongly supportsa CN-LoH, with no net decrease of chromosomal material. Chromosome 22 deletions orloss-of-heterozygosity are frequently reported in a wide range of cancers, partly due to

76

Figure 11.2: A copy-number profile of chromosome 22 from a patient with relapsingmantle cell lymphoma. The control-paired gene-wise read depth ratios on variant allelefrequencies strongly imply a significant copy gain without allelic imbalance in the firstpart of the chromosome and a highly significant copy-neutral loss in the latter part, wheretwo copies are still present.

the firmly establish the tumor-suppressing role of the NF2 gene (Neurofibromin 2). Ofnote, chromosome 22 is acrocentric with a p-arm devoid of coding genes.

Now, as NGS is being utilized widely in laboratory diagnostics, it is of utmost importanceto gain an accurate detection of CN-LoH, adding substantial value to current cytogenetics.From our results, we suggest that pre-sequencing, such as correct sampling, cell sorting,and sequencing parameters, most notably coverage and batch effects, are as critical stepsas the different algorithms available. We also observed that heterozygous VAF standarddeviations are influenced by the total number of batch reads and that CNA has a profoundeffect on the AF distribution itself: Large deviations in VAF, also utilized by SNP arrays,are often visible by frequency plots alone. This allelic imbalance is readily comprehensibleas the observed distribution is directly influenced by copy gain and copy loss. Assessmentof coverage enables more sensitive detection, and parallel analyses of both offer stringentvalidation of CNAs and resolution of partial CN-LoH or complete isodisomies.

77

Figure 11.3: Theoretical assessment of statistical significance based on coverageand relative percentage of allelic change using the �2 test on 2x2 contingency tables

A close agreement between CNAs detected by means allele frequency changes resolvedby �2 statistics and change in DP ratio exists, as shown here. Also, we observed thatstatistical comparison of heterozygous VAF standard deviations is a potentially powerfulsupplement for the detection of low-burden chromosomal allelic imbalance. While specificcopy number changes can be suspected from sample allele frequencies without a pairedcontrol, this is generally not feasible by standard coverage analysis for WES. Additionalcaveats include batch variability, which affects variant allele frequencies to a lesser degree.Thus, the derivation of chromosomal copy alterations from VAF is a powerful tool on itsown. Although the most frequently utilized methods rely on DP ratios, the usage of VAFhas been described several years ago (Wang et al., 2015).

Common for the approaches presented is the theoretical ability to evaluate the tumorburden of the copy change, but it is often challenging. One important notice is that theallelic imbalance suffer from allele frequency overlap, as a function of both variance andburden, and potential subclonal involvement.

The previous chapter on the allelic imbalance of VAF covered how each read of a givenvariant and the reference allele could be viewed as sample observations, drawn randomlyand disjoint, to calculate �2 statistics of independence. The motivation is to test whether

78

an observed VAF is likely to have changed relative to the expected (i.e., V AFexp =12 6= V AFobs). It is trivial to see how the detection of VAF imbalance inflicted by achromosomal copy alteration, although recommended, can bypass the need for a matchedcontrol in certain applications.

79

12 Discussion

The presented research has two parallel and central themes in hematology;One line investigates the current and past usage of targeted sequencing andtechnical problems hereof, and the other explores biological aspects of mantlecell lymphoma as a case study. The primary effort of the overall projectwas to optimize NGS bioinformatics workflows by an empirical data-drivenapproach, in contrast to the purely algorithm-driven processing of data.

In contemporary biomedical research, bioinformatics has become an integrated part,driven by the large amounts of data generated even in modest studies. As the last in-cluded manuscript demonstrates (Paper VI), it is becoming possible to carry out efficientand fast sequencing data processing on a powerful local workstation, even outperforminghigh-performance computing clusters in some aspects, as shown by the speed comparison.Variant-calling with newer algorithms that can exploit modern machine architecture, herethe GATK4 Spark-enabled command-line tools, was found to be several times faster thanthe previous versions. The main reasons are faster hardware and software optimization.The initial analyses of the included MCL cases were contemporary with the release of defacto standard Genome Analyses ToolKit 4 (GATK4). However, the mutation detectionin the first releases of Mutect2 in GATK4, which still lacks proper parallelization, couldtake more than a day for one tumor-control paired sample at the same time.

In an attempt to solve this issue, scattering sequencing data into smaller regions, followedby recombination, mediated faster processing and is suggested by the Broad Institute(MIT and Harvard, Cambridge, MA, USA) as a simple parallelization approach. Thisdivide-and-conquer strategy was also implemented for the copy-number workflow usedin Paper II–III where alignment and subsequent parallel chromosome-wise read depthanalyses could be reduced to approximately one hour on a 24 logical-cores CPU. Similarly,

80

the entire workflow from raw sequences to variant-calling could be finished within an hourusing Spark, which is slowly being implemented in GATK4. The introduction of fast solid-state disks (PCI express/NVMe) and cheap computer memory in the last couple of yearshave made it possible to carry out whole-exome and genome bioinformatics for cancerresearch on midrange workstations. Since 2018, where the benchmark was performed, ithas now become possible to implement 128 logical cores/threads (64 cores) in an ordinaryworkstation. These and other advances help make NGS available to researchers who donot yet have the proper training or high-performance computing clusters within reach.A commercial bioinformatics solution, such as CLC Biomedical Genomics Workbench(Qiagen), may provide a practical alternative to the specialized Linux command-line toolsused in parallel in this project. The major downside to this is poor or costly scalabilityand lack of transparency. However, I found that running two different bioinformaticspipelines for variant and copy-number calling in several instances provided additionalinformation to confirm or reject observations.

12.1 WES flexibility and limitations

One advantage of the sequencing modalities is the potential to combine several previouslaboratory analyses into a single assay, particularly mutational screening and cytogenicinvestigations to the breadth of the genome. Other usages include profiling of gene ex-pression levels and clonal characterization – from bulk analyses to single-cell sequencing.Thus, it is an integrated approach that can become a powerful tool in combination withproper annotation from external biomedical databases.

Whole-exome sequencing is positioned between WGS and targeted panel sequencing. Itincorporates the breadth of the first method covering the coding genome and the lat-ter’s focus to reach a higher depth of coverage at a lower cost than WGS. The commonstrategy of capture-based targeted sequencing makes WES technically more comparableto panel sequencing, posing different challenges to WGS. All NGS modalities and librarypreparations have specific error rates, which can be characterized systematically as brieflydemonstrated.

The presented data shows that the false-positive rate steeply rises as the VAF drops,becoming discouraging below 10% for WES. Thus, even with targeted panel sequenc-ing, using a depth of coverage in the order of thousands, it is impossible to eliminatefalse-positive variants. Of note, whether subclonal mutations of very low burden are

81

detected from sequences may be regarded as a stochastic process, as modeled by thePoisson probability distribution. This problem is analogous to the detection of MRDemploying immunoglobulin gene rearrangements in B lymphocytes. Paper V presents aconceptual and mathematical critique of the sequencing of residual disease from clonallyrearranged immunoglobulins and is elaborated upon in the results section. Not surpris-ingly, the amount of DNA and coverage must support the intended sensitivity. Despitethis, it appears that this may be an overlooked issue in the current literature, and thusthis particular topic is relevant for the detection of low-burden somatic alterations, ingeneral. If performed correctly, the high specificity of rearranged DNA or larger indels incombination with the procedure’s sensitivity becomes a powerful tool for clinical MRDapplications.

12.2 Comments on variance

Because neither the number of reads nor DNA quality alone is an effective proxy for thesequencing resolution, I propose that the standard deviation of heterozygous variant allelefrequencies provides such a measure: It is an effective indicator of sample purity, quality,integrity, and the concentration or amount of nucleic acids – all-important attributes ofthe final resolution.

One of the efforts to augment the resolution here was implementing fluorescence-activatedcell sorting of antibody-labeled mononuclear cells in Paper II. Although a simple con-cept, often highly suitable for hematologic cancers where the immunophenotype is oftenknown, it has not received much attention in the literature on improving sequencing qual-ity. Relative to bulk sequencing or whole genome amplification of only a few cells, thestandard deviation of heterozygous variant allele frequencies of the sorted MCL cells ofPaper II was markedly decreased. Previously, in parts of my previous research, it was in-dicated that purified T cells based on the T lymphocyte co-receptor, CD3 (Hansen et al.,2015b)*, could leverage the quality for control samples, whereas the sequencing librarypreparation of the case samples was in fact performed on bone marrow mononuclear cells.Furthermore, the direct correlation between the amount of input DNA and quality haspreviously been demonstrated in a study characterizing the sequencing signal-to-noise ra-tio in replicates of one or more micromanipulated cells to approximately 50 cells derivedfrom a leukemia cell line (Simonsen et al., 2018)*. Quite recently, this usage of the stan-dard deviation to estimate the signal-to-noise ratio, SNR = µ

� , has been adopted in theanalysis of vector integration in patients’ hematopoietic cells (Cesana et al., 2021).

82

Despite not being a consensus gauge for quality in NGS studies, the dispersion of het-erozygous variant allele frequencies is directly compatible with the statistical measurecoefficient of variation, i.e., the relative standard deviation, (CV = �

µ ), or the frequentlyapplied SNR in engineering (SNR = CV �1). Whereas the mean of heterozygous variantsis constant even for chromosomes afflicted with trisomy, a sudden rise in � when slidingalong the genome is a strong indicator of allelic imbalance. As deduced in the theoreti-cal framework, the burden of low allelic imbalance can be estimated from the derivativefunction of the VAF kernel density estimation (KDE) and � of the approximated normaldistribution. To my knowledge, this is the first formal model of the problem address-ing the lower frequency range. In cases where a large proportion of the bulk containscells harboring allelic imbalance, the burden can be directly estimated by the modes ofthe KDE. As I have shown, high burden CNA also affects the number of reads for theimplicated region and thus the variance.

Effective removal of false-positive mutations is one of the most important and ongoingissues in NGS research. Whole-genome, exome, and panel sequencing inevitably give riseto a substantial amount of false positives. An error rate above 0.1% has been estimatedpreviously (Ma et al., 2019; Glenn, 2011; Goodwin et al., 2016; Mardis, 2013). This fig-ure was based on previous Illumina sequencers, such as Illumina Genome Analyzer II,HiSeq1000, and HiSeq2000, but it is still a persistent issue. Consequently, a genome se-quenced with a shallow depth of coverage of four reads (4x) may contain several millionerroneous base calls. This figure is in the same order of magnitude as expected single nu-cleotide variants, and it will be difficult to ascertain whether any given variant is germline,somatic, or merely introduced by technical bias. In the first decade following the introduc-tion of sequencing by synthesis, it was common to confirm somatic candidates by Sangersequencing (Cleveland et al., 2018), although with a lower sensitivity. Other strategies tocurtail erroneous bases include sequencing replicates or increasing the depth of coverage.Because NGS assays are flexible, it is attractive to increase the coverage (Meynert et al.,2013); however, it is not possible to mitigate errors arising from amplification of the nu-cleic acids. Although it is suggested that sequencing replicates can play an increasing rolein noise reduction, there is nothing in current NGS usage that has indicated that this willbecome the standard (Robasky et al., 2014), despite the continuously falling sequencingcosts.

83

12.3 The role of replicates

It has been shown that sequencing of biological or technical replicates increases the pre-cision of the assay more effectively than increasing coverage (Liu et al., 2014; Ziller et al.,2015). Several different strategies exist, such as technical and biological replicates or, asemphasized here, the combination of the coding genome and transcriptome. However,while replicates are frequently implemented in qPCR assays, there is no such consensusfor NGS, partly owing to additional costs, labor, or amount of DNA used. Replicateshave been implemented in other forms in earlier research for direct comparison of allelicdropout in single and sparse cell assays (Simonsen et al., 2018)*, together with former col-leagues. In another study, a more robust control sample was established by Line Nederbyfrom both cultured keratinocytes and fibroblasts (Hansen et al., 2018)* as a biologicalreplicate. In a third, a pseudo-germline variant set was established in relation to a studyof monozygotic twins with chronic monoclonal B cell lymphoproliferation (Hansen et al.,2015b)*. The latter study was based on the presumption that the twins carried identicalgenomes at conception, and thus the germline control was deemed confident in positionswhere variant alleles were identical.

From the large 1000 Genomes project, it has been suggested, based on empirical evi-dence, that implementing at least two different technologies reduces the false-positiverate (Nothnagel et al., 2011). Analogous to DNA replicates, I have previously demon-strated that the combination of the transcriptome and coding genome mitigated falsepositives and improved the detection of somatic driver mutations (Hansen et al., 2016)*.Furthermore, the direct intersection of mutation candidates from control-paired DNA andRNA samples revealed that it is possible to narrow down the number of somatic singlenucleotide variants and small indels one to two orders of magnitude to very few expresseddrivers with high concordance to clinical laboratory analyses. Concerning Paper II, andin concordance with previous research, the number of somatic mutations was markedlydecreased when using combined DNA and RNA sequencing, in contrast to the numbercalled by the former state-of-the-art algorithm MuTect (Cibulskis et al., 2013)(BroadInstitute, Cambridge, MA, USA), which was ten-fold higher than the RNA-intersectedsomatic mutations. Also, a discouraging small overlap was observed between the twodifferent calling algorithms, MuTect and CLC Genomics Workbench (QIAGEN Aarhus,Denmark).

On the other hand, somatic mutations are not necessarily expressed in the investigated

84

cell. The lack of expression may be tissue-specific, be the result of a biological cue, or besomatically induced – exemplified by changes introducing a stop codon or frameshift asdescribed in a case of AML with normal karyotype harboring a somatic WT1 mutation(Hansen et al., 2016)*. In such cases, the absence of gene expression may be a resultof nonsense-mediated decay. Consequently, it is also recommended to assess mutationscalled by DNA sequencing independently. Cells with intensive proliferation and DNAreplication, and perhaps increased genomic instability, will inevitably acquire additionalpassenger mutations with no known function in the context of the tumor. Not all of thesemutations are contributing to the cancer phenotype. Both of the referenced studies showthat the resulting few somatic candidates are in stark agreement with census oncogenesor tumor suppressors for the malignancy, i.e., TP53, ATM and NOTCH1 in mantle celllymphoma or IDH1, IDH2 and NPM1 in the referenced AML cases. Despite this, onestudy has shown only minimal overlap between single-nucleotide variants detected fromboth DNA and RNA (14%, O’Brien et al. (2015)), and collectively the published studiesincluding both RNA and DNA sequencing have been sparse. One such study used bothto investigate variants in patient-derived tumor xenografts (Rokita et al., 2019), whereasanother recent study also integrated exome and RNA sequencing but reserved the lat-ter for the detection of fusion-genes and expression analysis (Hirata et al., 2019). Thereluctance for using such a strategy is interesting in the light of the cardinal paper byThe Cancer Genome Atlas Research Network, being a forerunner of the genomic age inhematology, which investigated the genomic landscapes of 200 de novo AML cases in un-precedented details using WGS, WES, methylation arrays, RNA and miRNA-sequencing(Cancer Genome Atlas Research et al., 2013). In the presented study (Paper II), tran-scriptome sequencing leveraged the genomic profile in several ways: First of all, it reducedfalse-positive observations, and secondly, it held expressional information, which was notevident from DNA sequencing alone. Also, a consensus between RNA expression levelsand qPCR was found. Another definite advantage of implementing RNA is to detect alter-native splicing or exon skipping, which is not detectable from genomic DNA alone. And,finally, a more subtle usage of replicate or combined DNA/RNA sequencing has been thecrosschecking of the unique variant fingerprint of an individual (Cummings et al., 2017).Either way replicates reduce the risk of accidental sample swapping to a minimum.

85

12.4 Integrative molecular diagnostics

Perhaps the most compelling result or outcome of these collective studies here is the factthat NGS may be used for integrative molecular diagnostics or prognostics. Several of theidentified molecular lesions within these papers have a direct or suggestive role in diag-nostics and prognostics. Of course, the most evident are the ATM and TP53 mutationsin Paper II and IV, but it is, in my opinion, apparent from the mutational profiles thata proper workflow enables efficient screening of patient-specific cancer drivers. At thetime of sequencing, little evidence supported a role for several of the mutations, such asSPEN, but new evidence is continuously being gathered. It is now known that CCND1mutations are recurrent, particularly being associated with non-nodal MCL (Puente et al.,2018).

Interestingly, despite being characterized as markedly heterogeneous, MCL seems to havecoherent principles such as the co-occurrence of CCND1 mutations and IGHV-mutationsdescribed by Bea et al. (2013), as is the case for Patient 3 in Paper II. The doublemutated CCND1 mutations were located in the same N-terminal region (amino acid41–47) as the cases reported by Bea et al. (2013). In the same line, it is not surprisingto find a wide range of CNA in ATM or TP53 mutated cases. Multiple studies haveshown that complex karyotypes and extensive CNA are risk factors and are associatedwith poor prognoses (Hieronymus et al., 2018; Parkin et al., 2010; Greenwell et al., 2018).In contrast, Patient 4 was distinct by the relatively low number of aberrations and lowCCND1 expression resolved by RNA-sequencing. These results pointed to a genotypesimilar to splenic marginal zone lymphoma, in which SPEN and MYD88 mutations andchromosome 7q32 deletions are recurrent (Jaramillo Oquendo et al., 2019). Cases ofMCL resembling marginal-zone lymphoma have previously been described (Randen et al.,2011), but it may also, in frank genomic terms, be more related to a splenic marginalzone lymphoma – an indolent B-cell non-Hodgkin lymphoma. In this setting, it is thusinteresting that Patient 4 was the only patient in this study who did not relapse andexperienced a slow progression.

This integrative potential was also exemplified in two cases: The first one being in aprevious case of diagnosed T-ALL in a young adult concluded to be a immature earlyT-cell precursor leukemia originating from early thymic progenitors with myeloid andlymphoid potential (Hansen et al., 2018)*. The other example was a case of CN-AML,where CN-LoH was found during the development of the methods described in Paper III.

86

While the first referenced study provided clear evidence of subclonal involvement whenpairing mutational frequencies and the clearance of mutations between diagnosis, firstrelapse, and second relapse, I have found no convincing evidence of co-existing subcloneswithin the MCL cases. However, the results of Paper II do indicate from clearance ofsignificant chromosome 12 and 13 losses that a pre-diagnostic clone exists for Patient 1,which gives rise to a genotypical difference, and thus likely phenotypical different relapseclone. In contrast to single-cell sequencing, the use of WES to infer subclonal involvementis tricky as bulk sequencing offers no spatial resolution. In addition, it is challenged bythe potential frequency overlap of clones and the complexity of allelic imbalances.

In my view, the premalignant constituents or the potential role of a clonal bone mar-row disorder preceding MCL is still a currently under-investigated theme. Our researchgroup’s latest results show that more immature SOX11+ pro-/pre-B cells exist amongpresumably normal SOX4+. This phenomenon was demonstrated from a population ofmore than 30,000 cells from eight patients with MCL (Hansen et al., 2021)*. Theseresults emphasized the heterogeneity between the individuals with little evidence of sub-clonal involvement. The results are interesting from the knowledge that t(11;14) is notconfined to cancer (Espinet et al., 2014), but may be present in otherwise healthy indi-viduals. Although a different disorder, the theme of an early precursor capacity was alsoexplored with the mentioned immature T-ALL. This case supported a model in whichIKZF1 p.G158S together with CDK2NA nonsense mutation p.R80* were presumably in-herent to the founding clone, and FLT3 p.D835Y, often observed as a latecomer in themalignant progression, being a subclonal addition (Hansen et al., 2018)*. The eradicationof the FLT3 -TKD point mutation following treatment was replaced by a FLT3 internaltandem duplication, apparently complete loss of functional CDKN2A by a retained non-sense mutation and a CN-LoH on chromosome 9p, which harbors this cyclin-dependentkinase inhibitor. The molecular portrait made it possible to suggest that a more subtlediagnosis of early T-cell precursor acute lymphoblastic leukemia (ETP-ALL), which isnow identified as a high-risk subtype with an abysmal prognosis.

The number of mutations in different cancer forms may vary as much as a 1,000-fold(Lawrence et al., 2013; Alexandrov et al., 2013). A large discrepancy in the number ofobserved mutations relative to the expected of a given hematologic malignancy may in-dicate a problem within the workflow. As an example, Cancer Genome Atlas Researchet al. (2013) and others have shown that AML generally displays few somatic mutations,contrasting, for example, lung cancer, colon cancer, or melanoma. In the same manner,the number of individual driver mutations in MCL may be quite low. Thus, in spite of

87

being characterized as very heterogeneous, it appears that at least the list of putativesomatic driver mutations is limited but may vary if focusing on solid tumors. The ge-nomic positions often vary even in recurrent gene mutations, and, as such, the extensivecombination of mutations and structural variants calls for rigorous annotation.

From a clinical perspective, it is challenging to bridge the variant data derived fromNGS tests to actionable insights unless clinical action is solely based on consensus drivermutations. By experience, sifting through potential somatic candidates is often a dauntingtask and somewhat influenced by subjectivity. Together with a substantial amount offalse-positives by the somatic calling algorithm, it is an ongoing concern that scantyevidence leads to faulty conclusions. To alleviate the problem and partially circumventthe subjectivity of variant assessment, Paper II utilizes a strategy where transcriptssupport the coding genome.

Annotation, filtering, and curation are some of the most crucial, and perhaps most chal-lenging steps, in the collective workflow of NGS bioinformatics. The knowledge base andsupporting evidence to confidently retain or reject a somatic candidate as being causal isever-changing. Not only do formats change over time, but also the compatibility with agiven genome assembly or bioinformatics tool. External annotation resources are oftenconsiderable in size, such as the dbSNP database or the Catalogue Of Somatic MutationsIn Cancer (Sanger Institute, Cambridgeshire, UK). Because annotation is time-consumingand error-prone, this branch of computational analysis can be regarded as a specializedsubfield of NGS bioinformatics.

A crucial part of the included studies related to the proper annotation of the genomic data.It is one of the most elusive and error-prone steps in of sequencing research projects, as thepublicly available data changes constantly and is stored in a variety of formats. Masteringdata annotation, curation, and visualization is a tedious and often confounding process,requiring the researcher to develop adept analytical skills. Concerning my projects, theannotation of NGS data has perhaps been the most time-consuming. No consensus existson how to curate sequencing data from cancer genomes. Following the first sequencedgenome of a cancer patient (Ley et al., 2008), the problem was approached by usingtiers for the classification of somatic mutations found in another patient with AML; thefirst category, tier 1, comprised mutations in amino acid coding regions of exons, splice-site regions, and RNA genes (Mardis et al., 2009). Albeit, this is not sufficient for theinterpretation of coding variants. However difficult it may be, sequencing annotationforms an essential pillar of modern cancer research.

88

The annotation techniques used for the presented research have primarily been based onthe previously described work (Hansen et al., 2015a)* and from developed cross-platformannotation software, VariFant, developed in the Xojo programming environment andlanguage (Xojo, Houston, TX, USA). These techniques have been extensively refined forthe use in several projects (Hansen et al., 2015b, 2018; Simonsen et al., 2018; Hansen et al.,2016; Simonsen et al., 2020)* (in addition to Paper II and IV), and have been a time-consuming effort in pairing data with relevant and meaningful clinical information, onlinebiomedical databases, such as the Catalogue Of Somatic Mutations In Cancer (Bamfordet al., 2004) or dbSNP, and automatic literature search through PubMed. Regrettably,this project has currently been abandoned, unpublished.

12.5 Copy-number calling in WES

With the capability to map the complex combinations of recurrent lesions of a cancergenome, the attention was foremost directed towards somatic mutations, initially with astrong emphasis on single-nucleotide variants. However, it has become clear that mosttypes of short-read sequencing, be it a panel, exome, or WGS, provide the researchermeans of inferring large structural somatic variants on the chromosomal level. Relying onWES to call CNA has been heavily debated in the literature. From the research conductedover the last couple of years, I dismiss the notion that WES is unsuited for resolving CNA,although some limitations exist, and it is rapidly being replaced by the more unbiasedsequencing of the complete genome. Instead, failure to resolve the CNA is foremost anissue of poor quality, low sequencing depth, tumor burden, chosen laboratory workflow,or inappropriate data processing. It is found that WES is very sensitive towards batcheffects; however, plotting the linear correlation of regional reads between tumor-controlpaired samples is an effective way of checking read-depth inconsistencies. I find that thelack of transparency is a general issue in published studies involving NGS, making itdifficult to review or reproduce such studies. Despite many available CNA tools, it is nota trivial task to pinpoint an algorithm that performs well in all situations.

The ability to discern chromosomal copies is readily demonstrated by the lenient copynumber profile of Patient 1 (Paper II), with more than half of the chromosomes be-ing afflicted. One of the strongest associations with decreased survival is the deletionof chromosome 17p and TP53 mutations, as is found in Patients 2 and 3. This alsopertains to CLL (Campo et al., 2018) and other hematologic malignancies. It is almostuniversally applicable in cancer diagnostics, yet to my knowledge, even TP53 screening

89

is only now becoming routine for many hospital laboratories. Although the work withPaper IV exemplified some of the problems with CNA detection in WES, mainly due toincreased noise, it showed that detection of allelic imbalance through change in variantallele frequencies is much less affected by sequencing batch effects in comparison to the re-gional shift in read-depths. Patient 4, who experienced a long-term response to ibrutinibfollowing intraocular infiltration of MCL cells, is the first case I have encountered withconcurrent lesions of ATM and TP53, detected within the dorsal tumor at relapse.

Copy-number variation/alterations are generally detected in NGS reads by two major, yetdifferent, approaches: read-depth ratios and VAF analysis. Because of the inherent noiseof WES, the size of the CNA has to be in the order of megabases to be confidently identifiedfrom altered read depth ratios between tumor and control samples, while allelic imbalancedue to duplication or deletion may be resolved from a single nucleotide variant. Perhapscoincidently, early large-scale studies have showed that the majority of copy gains andlosses most frequently were either short focal (recurrent) or large arm- and chromosome-level aberrations (Beroukhim et al., 2010). However, this may also point to some of thelimitations in relation to resolution by both array and sequencing technologies.

As the coding genome is found to contain roughly twenty-thousand variants (Stitziel et al.,2011; Ng et al., 2009; Hoischen et al., 2010; Bolze et al., 2010; Musunuru et al., 2010),on average, a single coding variant is expected per gene. This knowledge may form apractical quality measure as argued in Paper I, but also forms the basis of the proposedmethod in Paper III, where the juxtaposition of reads per gene and single nucleotidevariants supports the detection of copy-afflicted chromosomes or CN-LoH.

12.6 Final remarks

The limitations of my study pertain to the low number of patient samples and, in certainaspects, suboptimal quality. To a bioinformatician, data is power; both the amount ofthe data and its quality. As is the case for Paper II, any differential gene expressionanalyses will suffer from a low number of diagnosis-relapse paired samples. Furthermore,it is difficult to draw specific biological conclusions for most of the CNAs. Fortunately,the number of individuals diagnosed with MCL is relatively low, but this also meansthat it is cumbersome to include a large cohort of patients. This is especially true forCNS-infiltrating MCL, which affects very few patients a year in Denmark.

90

A wide range of bioinformatics tools exist. The tools are often time-consuming to im-plement correctly, and thus I have only evaluated a fraction of these, many of them notbeing described here. However, my primary focus has dealt with the more basic interfacebetween NGS data and its processing. This, by itself, has been a comprehensive task inthe last couple of years and is not confined to the details presented.

As the difference in cost between WES and WGS is leveling out, the latter will inevitablybe the first choice for many clinical applications. Still, the current state-of-the-art se-quencing platforms from Illumina, such as the NovaSeq 6000 system, generally delivershort reads and pose a challenge in the transition from cytogenetics to cytogenomics. Forsome applications, the Oxford Nanopore systems (Oxford Nanopore Technologies, Ox-ford, UK) can provide reads in kilobases or even megabases and may be the most suitableplatform for detecting extensive structural chromosomal alterations in cancer genomes,such as balanced translocations.

It is noted that health care information pages often describe the causality behind non-Hodgkin lymphoma as being unknown, but having entered the post-genomic era, it isnow clear that discrete lesions of the genome are the primary culprits driving the ma-lignancies. Yet, the defined term post-genomic, i.e., referring to the completion of thehuman genome, does not provide any reasonable justice to the massive recent and on-going research undertaking in the field of hematology. The development of hematologicmalignancy is partly described by stochastic somatic events in the setting of an inheritedgenotype, the immune system, and the aging genome, partly in successive and orderedevents. The direct implication is that distinguishing between a premalignant disorder anda cancerous state is not trivial and needs to be further explored in the future.

91

Part III

Appendix

92

13 Methods and materials

The included papers and samples were largely based on high-throughput second-generationsequencing on the Illumina systems (HiSeq 2000/2500, NextSeq550) with capture-basedtargeted sequencing. Exome enrichment and library preparation based on SureSelect (Ag-ilent Technologies, CA, USA), Nextera (Illumina, CA, USA), NimbleGen (F. Hoffmann-La Roche AG, Basel, Switzerland) kits. RNA libraries have been prepared with TruSeqstranded mRNA kit and TruSeq RNA Access Library Prep Kit (Illumina). The latter wasintroduced as a protocol for a low RNA input (10 ng) with high quality from purified cellsby sorting. DNA samples were generally sequenced with a paired-end read length of 100bases or more. Alignment (GRCh37) and variant calling was performed using BWA (Liand Durbin, 2009) and GATK (McKenna et al., 2010) (3.6/3.8/4) workflow (McKennaet al., 2010) or with CLC Biomedical Genomics Workbench (Qiagen). Variants and readslocated outside targeted regions were excluded from the analyses, and a hard filteredbase-quality threshold of 25 and minimum allele read depth of 20 were applied. Gene-wise read depths were retrieved with BEDTools (Quinlan and Hall, 2010) from RefSeqcoordinates (UCSC, Karolchik et al. (2004)). Statistics and data analysis of variants andgene read depth were performed in Mathematica (Wolfram Research, Ill, USA) and R(3.6). Gene-wise read-depth ratios were compared to the established CNA tool VarScan2Koboldt et al. (2012), which evaluates read-depth ratios without statistical inference.Figures were prepared in Prism 7/8 (GraphPad Software, CA, USA), Mathematica, andAdobe Illustrator. The general-purpose plotting software for juxtaposing variant allelefrequencies and read-depth ratios, implementing Fisher’s exact, �2 test and U test, wasdeveloped in Xojo (Xojo, Houston, TX, USA). Annotation was primarily performed withthe developed software Varifant (unpublished), which combined multiple online databases,such as COSMIC, PubMed, ClinVar, and dbSNP. I refer to the individual papers for morespecific information on the materials and methods.

93

Scripts for detection of copy-number

alterations in whole genomes

QDNAseq for unpaired CNA detection

The script implements the CNA detection algorithm QDNAseq described by Scheininet al. (2014). The algorithm is found highly practical for WGS (Fig. 13.1) and sup-ports parallel processing. However, it is not suited for retrieving statistically significantchange.

1

2 # Rscript qdnaseq.R [file.bam] 100

3

4 args = commandArgs(trailingOnly=TRUE)

5 case <- sub (".bam","",tail(strsplit(args[1], "/") [[1]],n=1))

6 opath <- "~/ Arcedi/output/RDratio_v2/qdnaseq /"

7

8 # LOAD LIBRARY

9 library(QDNAseq)

10

11 write(" Running qdnaseq ... may take some time", stdout ())

12

13 # USE PROVIDED BINSIZES IN KILOBASES AND LOAD BAM -FILE

14 bins <- getBinAnnotations(binSize=as.integer(args [2]))

15 readCounts <- binReadCounts(bins , args[1], pairedEnds=FALSE)

16 readCountsFiltered <- applyFilters(readCounts , residual=TRUE , blacklist=

TRUE)

17 readCountsFiltered <- estimateCorrection(readCountsFiltered)

18

19 readCountsFiltered <- applyFilters(readCountsFiltered , chromosomes=NA)

94

20

21 copyNumbers <- correctBins(readCountsFiltered)

22 copyNumbersNormalized <- normalizeBins(copyNumbers)

23 copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)

24

25 # PLOT THE SMOOTHENED COVERAGE PLOT

26 jpeg(paste(opath ,case ,". smooth.jpg",sep=’’))

27 plot(copyNumbersSmooth)

28 dev.off()

29

30 # OUTPUT RESULTS FOR FURTHER ANALYSES IF NEEDED

31 exportBins(copyNumbersSmooth , paste(opath ,case ,". smooth.txt",sep =""))

32 exportBins(copyNumbersSmooth , paste(opath ,case ,". smooth.igv",sep =""),

format ="igv")

33 exportBins(copyNumbersSmooth , paste(opath ,case ,". smooth.bed",sep =""),

format ="bed")

34 copyNumbersSegmented <- segmentBins(copyNumbersSmooth , transformFun ="sqrt")

35 copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)

36

37 jpeg(paste(opath ,case ,". segmented.jpg",sep =""))

38 plot(copyNumbersSegmented)

39 dev.off()

40

41 # PLOT THE SMOOTHENED COPY NUMBERS

42 copyNumbersCalled <- callBins(copyNumbersSegmented)

43 jpeg(paste(opath ,case ,". CNVs.jpg",sep =""))

44 plot(copyNumbersCalled)

45 dev.off()

Developed R script (RDratio) for parallelized detection

of CNAs

The script implements some of the concepts presented in this thesis for fast statisticaland visual detection of CNAs. It is suited for both high and low coverage WGS and canbe run in parallel in a high-performance computing cluster.

1

2 # CREATE STATISTICS AND PLOTS FROM READ DEPTH RATIOS

3 # Marcus Hansen , 20/2-20, [email protected]

4 # Note that R must be installed: conda install -c conda -forge r-base

5 # Note that execution must be performed in batch or queue , e.g.: srun --mem

=16g --pty /bin/bash

6

95

Figure 13.1: Copy-number profile from whole-genome sequencing of saliva sam-ple from healthy male donor using QDNAseq.

7

8 args = commandArgs(trailingOnly=TRUE)

9

10 case <- sub (".bed","",tail(strsplit(args[1], "/") [[1]],n=1))

11 control <- sub(".bed","",tail(strsplit(args[2], "/") [[1]],n=1))

12 resultpath ="~/ RDratio /"

13

14 # Define significance threshold and number of observations in each test

15

16 if (length(args)==3 || length(args)==4) {

17 n <- as.numeric(args [[3]])

18 } else {

19 n <- 100

20 }

21

22 if (length(args)==4) {

23 significancelevel <- as.numeric(args [[4]])

24 } else {

25 significancelevel <- 0.01

26 }

27

28

96

29 # Import case and control

30 a <- read.delim(paste(resultpath ,"BED/",args[1],sep =""),header=FALSE)

31 b <- read.delim(paste(resultpath ,"BED/",args[2],sep =""),header=FALSE)

32 c <- merge(a,b,by=4,sort = FALSE)

33

34

35 # Normalize and remove extreme outliers

36

37 c[,5] <-round(c[,5]/sum(c[, 5]) *10^( floor(log10(sum(c[, 5])))) ,0)

38 c[,9] <-round(c[,9]/sum(c[, 9]) *10^( floor(log10(sum(c[, 9])))) ,0)

39 c <- subset(c, c[,5] < (quantile(c[,5]) [[4]]+ IQR(c[,5])*3) & c[,9] < (

quantile(c[,9]) [[4]]+ IQR(c[,9])*3))

40 c <- subset(c, c[,5] > 100 | c[,9] > 100)

41 c[,5] <- runmed(c[,5],11)

42 c[,9] <- runmed(c[,9],11)

43

44 # Calculate ratio and log2 of noisefiltered ratio

45 c[,10] <-c[,5]/c[,9]

46 c[,10][is.na(c[,10])] <- 1

47 c[,11] <- round(log2(runmed(c[,10] ,99)),digits =3)

48

49 # Define variables before U test

50 d <- c[,5]

51 e <- c[,9]

52 bonferroni <- significancelevel/length(d)

53 info <- c[,1]

54 parr <-c()

55 txtoutput <-c()

56 chrf <- ""

57 chrposf <-c()

58 chrlist <-c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",

"13" ,"14" , "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y")

59

60 for (u in 1: length(d)) {

61 if (c[u,2] != chrf) {

62 chrf = c[u,2]

63 chrposf <- append(chrposf , u)

64 }

65 }

66

67 chr <- ""

68 chrpos <-c()

69

70 # Perform U test and generate text output

71 for (i in 1:round(length(d)/n)) {

97

72 if (c[i*n,2] != chr) {

73 chr = c[i*n,2]

74 chrpos <- append(chrpos , i)

75 }

76 pval <- wilcox.test(d[(i*n-n+1):(i*n)],e[(i*n-n+1):(i*n)])$p.value

77 ifelse(pval < bonferroni , sig <-" significant", sig <-"-")

78

79 parr <- append(parr , pval)

80 txtoutput <- append(txtoutput ,paste(info[i*n],c[i*n,1],c[i*n,0],c[i*n,2],c

[i*n,3],n,round(sum(d[(i*n-n+1):(i*n)])/sum(e[(i*n-n+1):(i*n)]),digits

=3),pval ,sig ,sep = "\t", collapse = NULL))

81 }

82

83 # Get system date and transform it:

84 date <- Sys.Date()

85 newdate <- strptime(as.character(date), "%Y-%m-%d")

86 newdate <- format(newdate , "%d-%m-%Y")

87

88 # Create dir and paths for statistics and plots:

89 dir.create(paste(resultpath ," statistics/Bins=",n,"_Sign=",significancelevel

,"_",newdate ,sep =""))

90 stat_path= paste(" statistics/Bins=",n," _Sign=",significancelevel ,"_",

newdate ,"/",sep ="")

91 dir.create(paste(resultpath ,"plots/Bins=",n,"_Sign=",significancelevel ,"_",

newdate ,sep =""))

92 plot_path= paste(" plots/Bins=",n,"_Sign=",significancelevel ,"_",newdate

,"/",sep ="")

93

94

95

96 # Save text output (no further noise reduction is performed to get higher

sensitivity)

97 fileConn <-file(paste(resultpath ,stat_path ,case ,"_VS_",control ,"_",n,"_",

significancelevel ," _statistics.txt",sep =""))

98 writeLines(txtoutput , fileConn)

99 close(fileConn)

100

101 # Generate plots (some noise reduction is performed on significance plot

for visual specificity)

102 parr[is.na(parr)] <- 1

103

104 jpeg(paste(resultpath ,plot_path ,case ,"_VS_",control ,"_",n,"_",

significancelevel , "_rplot.jpg",sep =""))

105 par(mfrow=c(2,1))

98

106 plot(c[,11],ylim=c( -1.5 ,1.5), main=paste("case"," VS\n",control ,sep =""),

ylab = "Ratio", xlab = "Bin number", xaxt=’n’,pch = 9, cex = .1)

107 title=paste("U-test significance plot (bin size: ",toString(n) ," of 10Kb

each)\nAlpha level = ",toString(significancelevel) ,"\ nBonferroni

adjusted = ",toString(bonferroni) ,")",sep ="")

108 abline(v=chrposf , col=" gray92 ")

109 axis(1, at=chrposf , labels=chrlist)

110 print(title)

111 plot(runmed(log2(parr) ,11)*-1,type = "o", ylab = "p-value (red line:

adjusted alphalevel)", xlab = "Number of tests", xaxt=’n’,main=title ,

pch = 9, cex = .1)

112 axis(1, at=chrpos , labels=chrlist)

113 abline(h=log2(bonferroni)*-1, col="red")

114 abline(v=chrpos , col=" gray92 ")

115

116 dev.off()

99

Bibliography

Abrahamsson, A., Albertsson-Lindblad, A., Brown, P. N., Baumgartner-Wennerholm, S., Ped-ersen, L. M., D’Amore, F., Nilsson-Ehle, H., Jensen, P., Pedersen, M., Geisler, C. H., andJerkeman, M. (2014). Real world data on primary treatment for mantle cell lymphoma: anordic lymphoma group observational study. Blood, 124(8):1288–95.

Ahmed, M., Zhang, L., Nomie, K., Lam, L., and Wang, M. (2016). Gene mutations and actionablegenetic lesions in mantle cell lymphoma. Oncotarget, 7(36):58638–58648.

Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Aparicio, S. A., Behjati, S., Biankin, A. V.,Bignell, G. R., Bolli, N., Borg, A., Borresen-Dale, A. L., Boyault, S., Burkhardt, B., Butler,A. P., Caldas, C., Davies, H. R., Desmedt, C., Eils, R., Eyfjord, J. E., Foekens, J. A., Greaves,M., Hosoda, F., Hutter, B., Ilicic, T., Imbeaud, S., Imielinski, M., Jager, N., Jones, D. T.,Jones, D., Knappskog, S., Kool, M., Lakhani, S. R., Lopez-Otin, C., Martin, S., Munshi,N. C., Nakamura, H., Northcott, P. A., Pajic, M., Papaemmanuil, E., Paradiso, A., Pearson,J. V., Puente, X. S., Raine, K., Ramakrishna, M., Richardson, A. L., Richter, J., Rosenstiel,P., Schlesner, M., Schumacher, T. N., Span, P. N., Teague, J. W., Totoki, Y., Tutt, A. N.,Valdes-Mas, R., van Buuren, M. M., van ’t Veer, L., Vincent-Salomon, A., Waddell, N., Yates,L. R., Australian Pancreatic Cancer Genome, I., Consortium, I. B. C., Consortium, I. M.-S., PedBrain, I., Zucman-Rossi, J., Futreal, P. A., McDermott, U., Lichter, P., Meyerson,M., Grimmond, S. M., Siebert, R., Campo, E., Shibata, T., Pfister, S. M., Campbell, P. J.,and Stratton, M. R. (2013). Signatures of mutational processes in human cancer. Nature,500(7463):415–21.

Arber, D. A., Orazi, A., Hasserjian, R., Thiele, J., Borowitz, M. J., Le Beau, M. M., Bloomfield,C. D., Cazzola, M., and Vardiman, J. W. (2016). The 2016 revision to the world healthorganization classification of myeloid neoplasms and acute leukemia. Blood, 127(20):2391–405.

Arenillas, L., Mallo, M., Ramos, F., Guinta, K., Barragan, E., Lumbreras, E., Larrayoz, M. J.,De Paz, R., Tormo, M., Abaigar, M., Pedro, C., Cervera, J., Such, E., Jose Calasanz, M., Diez-Campelo, M., Sanz, G. F., Hernandez, J. M., Luno, E., Saumell, S., Maciejewski, J., Florensa,L., and Sole, F. (2013). Single nucleotide polymorphism array karyotyping: a diagnosticand prognostic tool in myelodysplastic syndromes with unsuccessful conventional cytogenetictesting. Genes Chromosomes Cancer, 52(12):1167–77.

Bamford, S., Dawson, E., Forbes, S., Clements, J., Pettett, R., Dogan, A., Flanagan, A., Teague,J., Futreal, P. A., Stratton, M. R., and Wooster, R. (2004). The cosmic (catalogue of somaticmutations in cancer) database and website. Br J Cancer, 91(2):355–8.

Bea, S. and Amador, V. (2017). Role of sox11 and genetic events cooperating with cyclin d1 inmantle cell lymphoma. Curr Oncol Rep, 19(6):43.

Bea, S., Valdes-Mas, R., Navarro, A., Salaverria, I., Martin-Garcia, D., Jares, P., Gine, E.,Pinyol, M., Royo, C., Nadeu, F., Conde, L., Juan, M., Clot, G., Vizan, P., Di Croce, L.,Puente, D. A., Lopez-Guerra, M., Moros, A., Roue, G., Aymerich, M., Villamor, N., Colomo,L., Martinez, A., Valera, A., Martin-Subero, J. I., Amador, V., Hernandez, L., Rozman, M.,Enjuanes, A., Forcada, P., Muntanola, A., Hartmann, E. M., Calasanz, M. J., Rosenwald,

100

A., Ott, G., Hernandez-Rivas, J. M., Klapper, W., Siebert, R., Wiestner, A., Wilson, W. H.,Colomer, D., Lopez-Guillermo, A., Lopez-Otin, C., Puente, X. S., and Campo, E. (2013).Landscape of somatic mutations and clonal evolution in mantle cell lymphoma. Proc NatlAcad Sci U S A, 110(45):18250–5.

Beekman, R., Amador, V., and Campo, E. (2018). Sox11, a key oncogenic factor in mantle celllymphoma. Curr Opin Hematol, 25(4):299–306.

Belkadi, A., Bolze, A., Itan, Y., Cobat, A., Vincent, Q. B., Antipenko, A., Shang, L., Boisson, B.,Casanova, J. L., and Abel, L. (2015). Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc Natl Acad Sci U S A, 112(17):5473–8.

Bernitz, J. M., Kim, H. S., MacArthur, B., Sieburg, H., and Moore, K. (2016). Hematopoieticstem cells count and remember self-renewal divisions. Cell, 167(5):1296–1309 e10.

Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan, J., Barretina,J., Boehm, J. S., Dobson, J., Urashima, M., Mc Henry, K. T., Pinchback, R. M., Ligon, A. H.,Cho, Y. J., Haery, L., Greulich, H., Reich, M., Winckler, W., Lawrence, M. S., Weir, B. A.,Tanaka, K. E., Chiang, D. Y., Bass, A. J., Loo, A., Hoffman, C., Prensner, J., Liefeld, T.,Gao, Q., Yecies, D., Signoretti, S., Maher, E., Kaye, F. J., Sasaki, H., Tepper, J. E., Fletcher,J. A., Tabernero, J., Baselga, J., Tsao, M. S., Demichelis, F., Rubin, M. A., Janne, P. A., Daly,M. J., Nucera, C., Levine, R. L., Ebert, B. L., Gabriel, S., Rustgi, A. K., Antonescu, C. R.,Ladanyi, M., Letai, A., Garraway, L. A., Loda, M., Beer, D. G., True, L. D., Okamoto, A.,Pomeroy, S. L., Singer, S., Golub, T. R., Lander, E. S., Getz, G., Sellers, W. R., and Meyerson,M. (2010). The landscape of somatic copy-number alteration across human cancers. Nature,463(7283):899–905.

Berry, N. K., Scott, R. J., Rowlings, P., and Enjeti, A. K. (2019). Clinical use of snp-microarraysfor the detection of genome-wide changes in haematological malignancies. Crit Rev OncolHematol, 142:58–67.

Bolze, A., Byun, M., McDonald, D., Morgan, N. V., Abhyankar, A., Premkumar, L., Puel,A., Bacon, C. M., Rieux-Laucat, F., Pang, K., Britland, A., Abel, L., Cant, A., Maher,E. R., Riedl, S. J., Hambleton, S., and Casanova, J. L. (2010). Whole-exome-sequencing-based discovery of human fadd deficiency. Am J Hum Genet, 87(6):873–81.

Braggio, E., Egan, J. B., Fonseca, R., and Stewart, A. K. (2013). Lessons from next-generationsequencing analysis in hematological malignancies. Blood Cancer J, 3:e127.

Bushell, K. R., Kim, Y., Chan, F. C., Ben-Neriah, S., Jenks, A., Alcaide, M., Fornika, D.,Grande, B. M., Arthur, S., Gascoyne, R. D., Steidl, C., and Morin, R. D. (2015). Geneticinactivation of traf3 in canine and human b-cell lymphoma. Blood, 125(6):999–1005.

Campo, E., Cymbalista, F., Ghia, P., Jager, U., Pospisilova, S., Rosenquist, R., Schuh, A., andStilgenbauer, S. (2018). Tp53 aberrations in chronic lymphocytic leukemia: an overview ofthe clinical implications of improved diagnostics. Haematologica, 103(12):1956–1968.

Cancer Genome Atlas Research, N., Ley, T. J., Miller, C., Ding, L., Raphael, B. J., Mungall,A. J., Robertson, A., Hoadley, K., Triche, T. J., J., Laird, P. W., Baty, J. D., Fulton, L. L.,Fulton, R., Heath, S. E., Kalicki-Veizer, J., Kandoth, C., Klco, J. M., Koboldt, D. C., Kanchi,K. L., Kulkarni, S., Lamprecht, T. L., Larson, D. E., Lin, L., Lu, C., McLellan, M. D.,McMichael, J. F., Payton, J., Schmidt, H., Spencer, D. H., Tomasson, M. H., Wallis, J. W.,Wartman, L. D., Watson, M. A., Welch, J., Wendl, M. C., Ally, A., Balasundaram, M., Birol,I., Butterfield, Y., Chiu, R., Chu, A., Chuah, E., Chun, H. J., Corbett, R., Dhalla, N., Guin,R., He, A., Hirst, C., Hirst, M., Holt, R. A., Jones, S., Karsan, A., Lee, D., Li, H. I., Marra,M. A., Mayo, M., Moore, R. A., Mungall, K., Parker, J., Pleasance, E., Plettner, P., Schein, J.,Stoll, D., Swanson, L., Tam, A., Thiessen, N., Varhol, R., Wye, N., Zhao, Y., Gabriel, S., Getz,G., Sougnez, C., Zou, L., Leiserson, M. D., Vandin, F., Wu, H. T., Applebaum, F., Baylin,S. B., Akbani, R., Broom, B. M., Chen, K., Motter, T. C., Nguyen, K., Weinstein, J. N.,Zhang, N., Ferguson, M. L., Adams, C., Black, A., Bowen, J., Gastier-Foster, J., Grossman,T., Lichtenberg, T., Wise, L., Davidsen, T., Demchok, J. A., Shaw, K. R., Sheth, M., Sofia,

101

H. J., Yang, L., Downing, J. R., et al. (2013). Genomic and epigenomic landscapes of adultde novo acute myeloid leukemia. N Engl J Med, 368(22):2059–74.

Cesana, D., Calabria, A., Rudilosso, L., Gallina, P., Benedicenti, F., Spinozzi, G., Schiroli, G.,Magnani, A., Acquati, S., Fumagalli, F., Calbi, V., Witzel, M., Bushman, F. D., Cantore,A., Genovese, P., Klein, C., Fischer, A., Cavazzana, M., Six, E., Aiuti, A., Naldini, L., andMontini, E. (2021). Retrieval of vector integration sites from cell-free dna. Nat Med.

Chang, V. Y., Quintero-Rivera, F., Baldwin, E. E., Woo, K., Martinez-Agosto, J. A., Fu, C.,and Gomperts, B. N. (2011). B-acute lymphoblastic leukemia and cystinuria in a patient withduplication 22q11.21 detected by chromosomal microarray analysis. Pediatr Blood Cancer,56(3):470–3.

Chase, M. L. and Armand, P. (2018). Minimal residual disease in non-hodgkin lymphoma -current applications and future directions. Br J Haematol, 180(2):177–188.

Cheah, C. Y., George, A., Gine, E., Chiappella, A., Kluin-Nelemans, H. C., Jurczak, W.,Krawczyk, K., Mocikova, H., Klener, P., Salek, D., Walewski, J., Szymczyk, M., Smolej,L., Auer, R. L., Ritchie, D. S., Arcaini, L., Williams, M. E., Dreyling, M., Seymour, J. F., andEuropean Mantle Cell Lymphoma, N. (2013). Central nervous system involvement in mantlecell lymphoma: clinical features, prognostic factors and outcomes from the european mantlecell lymphoma network. Ann Oncol, 24(8):2119–23.

Chen, G., Mosier, S., Gocke, C. D., Lin, M. T., and Eshleman, J. R. (2014). Cytosine deaminationis a major cause of baseline noise in next-generation sequencing. Mol Diagn Ther, 18(5):587–93.

Chen, Y. H., Gao, J., Fan, G., and Peterson, L. C. (2010). Nuclear expression of sox11 is highlyassociated with mantle cell lymphoma but is independent of t(11;14)(q13;q32) in non-mantlecell b-cell neoplasms. Mod Pathol, 23(1):105–12.

Chen, Z., Yuan, Y., Chen, X., Chen, J., Lin, S., Li, X., and Du, H. (2020). Systematic compar-ison of somatic variant calling performance among different sequencing depth and mutationfrequency. Sci Rep, 10(1):3501.

Chiu, B. C. and Weisenburger, D. D. (2003). An update of the epidemiology of non-hodgkin’slymphoma. Clin Lymphoma, 4(3):161–8.

Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., Gabriel,S., Meyerson, M., Lander, E. S., and Getz, G. (2013). Sensitive detection of somatic pointmutations in impure and heterogeneous cancer samples. Nat Biotechnol, 31(3):213–9.

Cleveland, M. H., Zook, J. M., Salit, M., and Vallone, P. M. (2018). Determining performancemetrics for targeted next-generation sequencing panels using reference materials. J Mol Diagn,20(5):583–590.

Cummings, B. B., Marshall, J. L., Tukiainen, T., Lek, M., Donkervoort, S., Foley, A. R., Bolduc,V., Waddell, L. B., Sandaradura, S. A., O’Grady, G. L., Estrella, E., Reddy, H. M., Zhao, F.,Weisburd, B., Karczewski, K. J., O’Donnell-Luria, A. H., Birnbaum, D., Sarkozy, A., Hu, Y.,Gonorazky, H., Claeys, K., Joshi, H., Bournazos, A., Oates, E. C., Ghaoui, R., Davis, M. R.,Laing, N. G., Topf, A., Genotype-Tissue Expression, C., Kang, P. B., Beggs, A. H., North,K. N., Straub, V., Dowling, J. J., Muntoni, F., Clarke, N. F., Cooper, S. T., Bonnemann,C. G., and MacArthur, D. G. (2017). Improving genetic diagnosis in mendelian disease withtranscriptome sequencing. Sci Transl Med, 9(386).

Davi, F., Langerak, A. W., de Septenville, A. L., Kolijn, P. M., Hengeveld, P. J., Chatzidimitriou,A., Bonfiglio, S., Sutton, L. A., Rosenquist, R., Ghia, P., Stamatopoulos, K., Eric, t. E. R. I.o. C. L. L., and the EuroClonality, N. G. S. W. G. (2020). Immunoglobulin gene analysis inchronic lymphocytic leukemia in the era of next generation sequencing. Leukemia, 34(10):2545–2551.

de Haan, G. and Lazare, S. S. (2018). Aging of hematopoietic stem cells. Blood, 131(5):479–487.

102

Dictor, M., Ek, S., Sundberg, M., Warenholt, J., Gyorgy, C., Sernbo, S., Gustavsson, E., Abu-Alsoud, W., Wadstrom, T., and Borrebaeck, C. (2009). Strong lymphoid nuclear expressionof sox11 transcription factor defines lymphoblastic neoplasms, mantle cell lymphoma andburkitt’s lymphoma. Haematologica, 94(11):1563–8.

Dolken, G., Dolken, L., Hirt, C., Fusch, C., Rabkin, C. S., and Schuler, F. (2008). Age-dependentprevalence and frequency of circulating t(14;18)-positive cells in the peripheral blood of healthyindividuals. J Natl Cancer Inst Monogr, (39):44–7.

Dreyling, M. H., Bullinger, L., Ott, G., Stilgenbauer, S., Muller-Hermelink, H. K., Bentz, M.,Hiddemann, W., and Dohner, H. (1997). Alterations of the cyclin d1/p16-prb pathway inmantle cell lymphoma. Cancer Res, 57(20):4608–14.

Ek, S., Dictor, M., Jerkeman, M., Jirstrom, K., and Borrebaeck, C. A. (2008). Nuclear expressionof the non b-cell lineage sox11 transcription factor identifies mantle cell lymphoma. Blood,111(2):800–5.

Engholm, G., Ferlay, J., Christensen, N., Bray, F., Gjerstorff, M. L., Klint, A., Kotlum, J. E.,Olafsdottir, E., Pukkala, E., and Storm, H. H. (2010). Nordcan–a nordic tool for cancerinformation, planning, quality control and research. Acta Oncol, 49(5):725–36.

Eskelund, C. W., Dahl, C., Hansen, J. W., Westman, M., Kolstad, A., Pedersen, L. B., Montano-Almendras, C. P., Husby, S., Freiburghaus, C., Ek, S., Pedersen, A., Niemann, C., Raty, R.,Brown, P., Geisler, C. H., Andersen, M. K., Guldberg, P., Jerkeman, M., and Gronbaek, K.(2017). Tp53 mutations identify younger mantle cell lymphoma patients who do not benefitfrom intensive chemoimmunotherapy. Blood, 130(17):1903–1910.

Espinet, B., Ferrer, A., Bellosillo, B., Nonell, L., Salar, A., Fernandez-Rodriguez, C., Puigde-canet, E., Gimeno, J., Garcia-Garcia, M., Vela, M. C., Luno, E., Collado, R., Navarro, J. T.,de la Banda, E., Abrisqueta, P., Arenillas, L., Serrano, C., Lloreta, J., Minana, B., Cerutti,A., Florensa, L., Orfao, A., Sanz, F., Sole, F., Dominguez-Sola, D., and Serrano, S. (2014).Distinction between asymptomatic monoclonal b-cell lymphocytosis with cyclin d1 overexpres-sion and mantle cell lymphoma: from molecular profiling to flow cytometry. Clin Cancer Res,20(4):1007–19.

Fang, N. Y., Greiner, T. C., Weisenburger, D. D., Chan, W. C., Vose, J. M., Smith, L. M.,Armitage, J. O., Mayer, R. A., Pike, B. L., Collins, F. S., and Hacia, J. G. (2003). Oligonu-cleotide microarrays demonstrate the highest frequency of atm mutations in the mantle cellsubtype of lymphoma. Proc Natl Acad Sci U S A, 100(9):5372–7.

Fisher, S. G. and Fisher, R. I. (2004). The epidemiology of non-hodgkin’s lymphoma. Oncogene,23(38):6524–34.

Gardina, P. J., Lo, K. C., Lee, W., Cowell, J. K., and Turpaz, Y. (2008). Ploidy status and copynumber aberrations in primary glioblastomas defined by integrated analysis of allelic ratios,signal ratios and loss of heterozygosity using 500k snp mapping arrays. BMC Genomics, 9:489.

Genomes Project, C., Auton, A., Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M.,Korbel, J. O., Marchini, J. L., McCarthy, S., McVean, G. A., and Abecasis, G. R. (2015). Aglobal reference for human genetic variation. Nature, 526(7571):68–74.

Genovese, G., Kahler, A. K., Handsaker, R. E., Lindberg, J., Rose, S. A., Bakhoum, S. F.,Chambert, K., Mick, E., Neale, B. M., Fromer, M., Purcell, S. M., Svantesson, O., Landen,M., Hoglund, M., Lehmann, S., Gabriel, S. B., Moran, J. L., Lander, E. S., Sullivan, P. F.,Sklar, P., Gronberg, H., Hultman, C. M., and McCarroll, S. A. (2014). Clonal hematopoiesisand blood-cancer risk inferred from blood dna sequence. N Engl J Med, 371(26):2477–87.

Glenn, T. C. (2011). Field guide to next-generation dna sequencers. Mol Ecol Resour, 11(5):759–69.

Gondek, L. P., Haddad, A. S., O’Keefe, C. L., Tiu, R., Wlodarski, M. W., Sekeres, M. A., Theil,K. S., and Maciejewski, J. P. (2007). Detection of cryptic chromosomal lesions including

103

acquired segmental uniparental disomy in advanced and low-risk myelodysplastic syndromes.Exp Hematol, 35(11):1728–38.

Goodwin, S., McPherson, J. D., and McCombie, W. R. (2016). Coming of age: ten years ofnext-generation sequencing technologies. Nat Rev Genet, 17(6):333–51.

Greenwell, I. B., Staton, A. D., Lee, M. J., Switchenko, J. M., Saxe, D. F., Maly, J. J., Blum,K. A., Grover, N. S., Mathews, S. P., Gordon, M. J., Danilov, A. V., Epperla, N., Fenske,T. S., Hamadani, M., Park, S. I., Flowers, C. R., and Cohen, J. B. (2018). Complex karyotypein patients with mantle cell lymphoma predicts inferior survival and poor response to intensiveinduction therapy. Cancer, 124(11):2306–2315.

Greiner, T. C., Dasgupta, C., Ho, V. V., Weisenburger, D. D., Smith, L. M., Lynch, J. C., Vose,J. M., Fu, K., Armitage, J. O., Braziel, R. M., Campo, E., Delabie, J., Gascoyne, R. D., Jaffe,E. S., Muller-Hermelink, H. K., Ott, G., Rosenwald, A., Staudt, L. M., Im, M. Y., Karaman,M. W., Pike, B. L., Chan, W. C., and Hacia, J. G. (2006). Mutation and genomic deletionstatus of ataxia telangiectasia mutated (atm) and p53 confer specific gene expression profilesin mantle cell lymphoma. Proc Natl Acad Sci U S A, 103(7):2352–7.

Gupta, S. K., Viswanatha, D. S., and Patel, K. P. (2020). Evaluation of somatic hypermutationstatus in chronic lymphocytic leukemia (cll) in the era of next generation sequencing. FrontCell Dev Biol, 8:357.

Hadzidimitriou, A., Agathangelidis, A., Darzentas, N., Murray, F., Delfau-Larue, M. H., Ped-ersen, L. B., Lopez, A. N., Dagklis, A., Rombout, P., Beldjord, K., Kolstad, A., Dreyling,M. H., Anagnostopoulos, A., Tsaftaris, A., Mavragani-Tsipidou, P., Rosenwald, A., Ponzoni,M., Groenen, P., Ghia, P., Sander, B., Papadaki, T., Campo, E., Geisler, C., Rosenquist, R.,Davi, F., Pott, C., and Stamatopoulos, K. (2011). Is there a role for antigen selection in mantlecell lymphoma? immunogenetic support from a series of 807 cases. Blood, 118(11):3088–95.

Hamborg, K. H., Bentzen, H. H., Grubach, L., Hokland, P., and Nyvold, C. G. (2012). Ahighly sensitive and specific qpcr assay for quantification of the biomarker sox11 in mantle celllymphoma. Eur J Haematol, 89(5):385–94.

Hansen, M. C., Cédile, O., Ludvigsen, M., Kjeldsen, E., Møller, P. L., Abildgaard, N., Hokland,P., and Nyvold, C. G. (2019). Integrating detection of copy neutral chromosomal losses in aclinical setting in leukemia and lymphoma by means of allelic imbalance and read depth ratiocomparison. bioRxiv.

Hansen, M. C., Herborg, L. L., Hansen, M., Roug, A. S., and Hokland, P. (2016). Combination ofrna- and exome sequencing: Increasing specificity for identification of somatic point mutationsand indels in acute leukaemia. Leuk Res, 51:27–31.

Hansen, M. C., Nederby, L., Kjeldsen, E., Petersen, M. A., Ommen, H. B., and Hokland, P.(2018). Case report: Exome sequencing identifies t-all with myeloid features as a ikzf1-struckearly precursor t-cell malignancy. Leuk Res Rep, 9:1–4.

Hansen, M. C., Nederby, L., Roug, A., Villesen, P., Kjeldsen, E., Nyvold, C. G., and Hokland,P. (2015a). Novel scripts for improved annotation and selection of variants from whole exomesequencing in cancer research. MethodsX, 2:145–53.

Hansen, M. C., Nyvold, C. G., Roug, A. S., Kjeldsen, E., Villesen, P., Nederby, L., and Hokland,P. (2015b). Nature and nurture: a case of transcending haematological pre-malignancies in apair of monozygotic twins adding possible clues on the pathogenesis of b-cell proliferations.Br J Haematol, 169(3):391–400.

Hansen, S. V., Hansen, M. H., Cédile, O., Møller, M. B., Haaber, J., Abildgaard, N., andNyvold, C. G. (2021). Dissection of single b lymphocytes in mantle cell lymphoma suggestinga potential use for sox4. Unpublished.

Heinrich, V., Stange, J., Dickhaus, T., Imkeller, P., Kruger, U., Bauer, S., Mundlos, S., Robin-son, P. N., Hecht, J., and Krawitz, P. M. (2012). The allele distribution in next-generation

104

sequencing data sets is accurately described as the result of a stochastic branching process.Nucleic Acids Res, 40(6):2426–31.

Herrmann, A., Hoster, E., Zwingers, T., Brittinger, G., Engelhard, M., Meusers, P., Reiser,M., Forstpointner, R., Metzner, B., Peter, N., Wormann, B., Trumper, L., Pfreundschuh, M.,Einsele, H., Hiddemann, W., Unterhalt, M., and Dreyling, M. (2009). Improvement of overallsurvival in advanced stage mantle cell lymphoma. J Clin Oncol, 27(4):511–8.

Hieronymus, H., Murali, R., Tin, A., Yadav, K., Abida, W., Moller, H., Berney, D., Scher, H.,Carver, B., Scardino, P., Schultz, N., Taylor, B., Vickers, A., Cuzick, J., and Sawyers, C. L.(2018). Tumor copy number alteration burden is a pan-cancer prognostic factor associatedwith recurrence and death. Elife, 7.

Hill, H. A., Qi, X., Jain, P., Nomie, K., Wang, Y., Zhou, S., and Wang, M. L. (2020). Geneticmutations and features of mantle cell lymphoma: a systematic review and meta-analysis. BloodAdv, 4(13):2927–2938.

Hirata, M., Asano, N., Katayama, K., Yoshida, A., Tsuda, Y., Sekimizu, M., Mitani, S.,Kobayashi, E., Komiyama, M., Fujimoto, H., Goto, T., Iwamoto, Y., Naka, N., Iwata, S.,Nishida, Y., Hiruma, T., Hiraga, H., Kawano, H., Motoi, T., Oda, Y., Matsubara, D., Fujita,M., Shibata, T., Nakagawa, H., Nakayama, R., Kondo, T., Imoto, S., Miyano, S., Kawai, A.,Yamaguchi, R., Ichikawa, H., and Matsuda, K. (2019). Integrated exome and rna sequencingof dedifferentiated liposarcoma. Nat Commun, 10(1):5683.

Hirt, C., Schuler, F., Dolken, L., Schmidt, C. A., and Dolken, G. (2004). Low prevalence ofcirculating t(11;14)(q13;q32)-positive cells in the peripheral blood of healthy individuals asdetected by real-time quantitative pcr. Blood, 104(3):904–5.

Hoischen, A., van Bon, B. W., Gilissen, C., Arts, P., van Lier, B., Steehouwer, M., de Vries, P.,de Reuver, R., Wieskamp, N., Mortier, G., Devriendt, K., Amorim, M. Z., Revencu, N., Kidd,A., Barbosa, M., Turner, A., Smith, J., Oley, C., Henderson, A., Hayes, I. M., Thompson,E. M., Brunner, H. G., de Vries, B. B., and Veltman, J. A. (2010). De novo mutations ofsetbp1 cause schinzel-giedion syndrome. Nat Genet, 42(6):483–5.

Hokland, P., Cotter, F. E., and Hansen, M. C. (2016). Can exome scans be expected to bepart of real-time decision-making in patients with haematological cancers? Br J Haematol,174(3):486–92.

Hokland, P., Woll, P. S., Hansen, M. C., and Bill, M. (2019). The concept of leukaemic stem cellsin acute myeloid leukaemia 25 years on: hitting a moving target. Br J Haematol, 187(2):144–156.

Holst, J. M., Plesner, T. L., Pedersen, M. B., Frederiksen, H., Moller, M. B., Clausen, M. R.,Hansen, M. C., Hamilton-Dutoit, S. J., Norgaard, P., Johansen, P., Eberlein, T. R., Mortensen,B. K., Mathiasen, G., Ovlisen, A., Wang, R., Wang, C., Zhang, W., Ommen, H. B., Stentoft,J., Ludvigsen, M., Tam, W., Chan, W. C., Inghirami, G., and d’Amore, F. (2019). Myelopro-liferative and lymphoproliferative malignancies occurring in the same patient: a nationwidediscovery cohort. Haematologica.

Hutter, G., Scheubner, M., Zimmermann, Y., Kalla, J., Katzenberger, T., Hubler, K., Roth,S., Hiddemann, W., Ott, G., and Dreyling, M. (2006). Differential effect of epigenetic alter-ations and genomic deletions of cdk inhibitors [p16(ink4a), p15(ink4b), p14(arf)] in mantlecell lymphoma. Genes Chromosomes Cancer, 45(2):203–10.

Irizarry, R. A., Warren, D., Spencer, F., Kim, I. F., Biswal, S., Frank, B. C., Gabrielson, E.,Garcia, J. G., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S. C., Hoffman, E., Jedlicka,A. E., Kawasaki, E., Martinez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush,J., Scott, A., Wilson, M., Yang, Y., Ye, S. Q., and Yu, W. (2005). Multiple-laboratorycomparison of microarray platforms. Nat Methods, 2(5):345–50.

Jaiswal, S., Fontanillas, P., Flannick, J., Manning, A., Grauman, P. V., Mar, B. G., Lindsley,R. C., Mermel, C. H., Burtt, N., Chavez, A., Higgins, J. M., Moltchanov, V., Kuo, F. C., Kluk,

105

M. J., Henderson, B., Kinnunen, L., Koistinen, H. A., Ladenvall, C., Getz, G., Correa, A.,Banahan, B. F., Gabriel, S., Kathiresan, S., Stringham, H. M., McCarthy, M. I., Boehnke, M.,Tuomilehto, J., Haiman, C., Groop, L., Atzmon, G., Wilson, J. G., Neuberg, D., Altshuler, D.,and Ebert, B. L. (2014). Age-related clonal hematopoiesis associated with adverse outcomes.N Engl J Med, 371(26):2488–98.

Jaramillo Oquendo, C., Parker, H., Oscier, D., Ennis, S., Gibson, J., and Strefford, J. C.(2019). Systematic review of somatic mutations in splenic marginal zone lymphoma. SciRep, 9(1):10444.

Jares, P. and Campo, E. (2008). Advances in the understanding of mantle cell lymphoma. Br JHaematol, 142(2):149–65.

Ji, W., Qu, G. Z., Ye, P., Zhang, X. Y., Halabi, S., and Ehrlich, M. (1995). Frequent detectionof bcl-2/jh translocations in human blood and organ samples by a quantitative polymerasechain reaction assay. Cancer Res, 55(13):2876–82.

Kadalayil, L., Rafiq, S., Rose-Zerilli, M. J., Pengelly, R. J., Parker, H., Oscier, D., Strefford,J. C., Tapper, W. J., Gibson, J., Ennis, S., and Collins, A. (2015). Exome sequence readdepth methods for identifying copy number changes. Brief Bioinform, 16(3):380–92.

Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alfoldi, J., Wang, Q., Collins,R. L., Laricchia, K. M., Ganna, A., Birnbaum, D. P., Gauthier, L. D., Brand, H., Solomonson,M., Watts, N. A., Rhodes, D., Singer-Berk, M., England, E. M., Seaby, E. G., Kosmicki, J. A.,Walters, R. K., Tashman, K., Farjoun, Y., Banks, E., Poterba, T., Wang, A., Seed, C., Whiffin,N., Chong, J. X., Samocha, K. E., Pierce-Hoffman, E., Zappala, Z., O’Donnell-Luria, A. H.,Minikel, E. V., Weisburd, B., Lek, M., Ware, J. S., Vittal, C., Armean, I. M., Bergelson, L.,Cibulskis, K., Connolly, K. M., Covarrubias, M., Donnelly, S., Ferriera, S., Gabriel, S., Gentry,J., Gupta, N., Jeandet, T., Kaplan, D., Llanwarne, C., Munshi, R., Novod, S., Petrillo, N.,Roazen, D., Ruano-Rubio, V., Saltzman, A., Schleicher, M., Soto, J., Tibbetts, K., Tolonen,C., Wade, G., Talkowski, M. E., Genome Aggregation Database, C., Neale, B. M., Daly, M. J.,and MacArthur, D. G. (2020). The mutational constraint spectrum quantified from variationin 141,456 humans. Nature, 581(7809):434–443.

Karolchik, D., Hinrichs, A. S., Furey, T. S., Roskin, K. M., Sugnet, C. W., Haussler, D., andKent, W. J. (2004). The ucsc table browser data retrieval tool. Nucleic Acids Res, 32(Databaseissue):D493–6.

Kennedy, R. and Klein, U. (2018). Aberrant activation of nf-kappab signalling in aggressivelymphoid malignancies. Cells, 7(11).

Khodadoust, M. S., Olsson, N., Wagar, L. E., Haabeth, O. A., Chen, B., Swaminathan, K.,Rawson, K., Liu, C. L., Steiner, D., Lund, P., Rao, S., Zhang, L., Marceau, C., Stehr, H.,Newman, A. M., Czerwinski, D. K., Carlton, V. E., Moorhead, M., Faham, M., Kohrt, H. E.,Carette, J., Green, M. R., Davis, M. M., Levy, R., Elias, J. E., and Alizadeh, A. A. (2017).Antigen presentation profiling reveals recognition of lymphoma immunoglobulin neoantigens.Nature, 543(7647):723–727.

Khoury, J. D., Sen, F., Abruzzo, L. V., Hayes, K., Glassman, A., and Medeiros, L. J. (2003).Cytogenetic findings in blastoid mantle cell lymphoma. Hum Pathol, 34(10):1022–9.

Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., Miller, C. A.,Mardis, E. R., Ding, L., and Wilson, R. K. (2012). Varscan 2: somatic mutation and copynumber alteration discovery in cancer by exome sequencing. Genome Res, 22(3):568–76.

Kolar, G. R., Mehta, D., Pelayo, R., and Capra, J. D. (2007). A novel human b cell subpopulationrepresenting the initial germinal center population to express aid. Blood, 109(6):2545–52.

Kridel, R., Meissner, B., Rogic, S., Boyle, M., Telenius, A., Woolcock, B., Gunawardana, J.,Jenkins, C., Cochrane, C., Ben-Neriah, S., Tan, K., Morin, R. D., Opat, S., Sehn, L. H.,Connors, J. M., Marra, M. A., Weng, A. P., Steidl, C., and Gascoyne, R. D. (2012). Whole

106

transcriptome sequencing reveals recurrent notch1 mutations in mantle cell lymphoma. Blood,119(9):1963–71.

Kruglyak, L. and Nickerson, D. A. (2001). Variation is the spice of life. Nat Genet, 27(3):234–6.

Ku, C. S., Wu, M., Cooper, D. N., Naidoo, N., Pawitan, Y., Pang, B., Iacopetta, B., and Soong,R. (2012). Exome versus transcriptome sequencing in identifying coding region variants. ExpertRev Mol Diagn, 12(3):241–51.

Kumar, A., Sha, F., Toure, A., Dogan, A., Ni, A., Batlevi, C. L., Palomba, M. L. M., Portlock,C., Straus, D. J., Noy, A., Horwitz, S. M., Moskowitz, A., Hamlin, P., Moskowitz, C. H.,Matasar, M. J., Zelenetz, A. D., and Younes, A. (2019). Patterns of survival in patientswith recurrent mantle cell lymphoma in the modern era: progressive shortening in responseduration and survival after each relapse. Blood Cancer J, 9(6):50.

Lai, R., Lefresne, S. V., Franko, B., Hui, D., Mirza, I., Mansoor, A., Amin, H. M., and Ma,Y. (2006). Immunoglobulin vh somatic hypermutation in mantle cell lymphoma: mutatedgenotype correlates with better clinical outcome. Mod Pathol, 19(11):1498–505.

Lander, E. S. and Waterman, M. S. (1988). Genomic mapping by fingerprinting random clones:a mathematical analysis. Genomics, 2(3):231–9.

Lawrence, M. S., Stojanov, P., Polak, P., Kryukov, G. V., Cibulskis, K., Sivachenko, A., Carter,S. L., Stewart, C., Mermel, C. H., Roberts, S. A., Kiezun, A., Hammerman, P. S., McKenna,A., Drier, Y., Zou, L., Ramos, A. H., Pugh, T. J., Stransky, N., Helman, E., Kim, J., Sougnez,C., Ambrogio, L., Nickerson, E., Shefler, E., Cortes, M. L., Auclair, D., Saksena, G., Voet,D., Noble, M., DiCara, D., Lin, P., Lichtenstein, L., Heiman, D. I., Fennell, T., Imielinski,M., Hernandez, B., Hodis, E., Baca, S., Dulak, A. M., Lohr, J., Landau, D. A., Wu, C. J.,Melendez-Zajgla, J., Hidalgo-Miranda, A., Koren, A., McCarroll, S. A., Mora, J., Crompton,B., Onofrio, R., Parkin, M., Winckler, W., Ardlie, K., Gabriel, S. B., Roberts, C. W. M.,Biegel, J. A., Stegmaier, K., Bass, A. J., Garraway, L. A., Meyerson, M., Golub, T. R.,Gordenin, D. A., Sunyaev, S., Lander, E. S., and Getz, G. (2013). Mutational heterogeneityin cancer and the search for new cancer-associated genes. Nature, 499(7457):214–218.

Lecluse, Y., Lebailly, P., Roulland, S., Gac, A. C., Nadel, B., and Gauduchon, P. (2009). t(11;14)-positive clones can persist over a long period of time in the peripheral blood of healthy indi-viduals. Leukemia, 23(6):1190–3.

Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman,D., Baggerly, K., and Irizarry, R. A. (2010). Tackling the widespread and critical impact ofbatch effects in high-throughput data. Nat Rev Genet, 11(10):733–9.

Lelieveld, S. H., Spielmann, M., Mundlos, S., Veltman, J. A., and Gilissen, C. (2015). Comparisonof exome and genome sequencing technologies for the complete capture of protein-codingregions. Hum Mutat, 36(8):815–22.

Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., McLellan, M. D., Chen, K., Dooling, D., Dunford-Shore, B. H., McGrath, S., Hickenbotham, M., Cook, L., Abbott, R., Larson, D. E., Koboldt,D. C., Pohl, C., Smith, S., Hawkins, A., Abbott, S., Locke, D., Hillier, L. W., Miner, T., Fulton,L., Magrini, V., Wylie, T., Glasscock, J., Conyers, J., Sander, N., Shi, X., Osborne, J. R., Minx,P., Gordon, D., Chinwalla, A., Zhao, Y., Ries, R. E., Payton, J. E., Westervelt, P., Tomasson,M. H., Watson, M., Baty, J., Ivanovich, J., Heath, S., Shannon, W. D., Nagarajan, R., Walter,M. J., Link, D. C., Graubert, T. A., DiPersio, J. F., and Wilson, R. K. (2008). Dna sequencingof a cytogenetically normal acute myeloid leukaemia genome. Nature, 456(7218):66–72.

Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with burrows-wheelertransform. Bioinformatics, 25(14):1754–60.

Liu, Y., Zhou, J., and White, K. P. (2014). Rna-seq differential expression studies: more sequenceor more replication? Bioinformatics, 30(3):301–4.

107

Lovec, H., Grzeschiczek, A., Kowalski, M. B., and Moroy, T. (1994). Cyclin d1/bcl-1 cooperateswith myc genes in the generation of b-cell lymphoma in transgenic mice. EMBO J, 13(15):3487–95.

Luc, S., Luis, T. C., Boukarabila, H., Macaulay, I. C., Buza-Vidas, N., Bouriez-Jones, T., Lut-teropp, M., Woll, P. S., Loughran, S. J., Mead, A. J., Hultquist, A., Brown, J., Mizukami,T., Matsuoka, S., Ferry, H., Anderson, K., Duarte, S., Atkinson, D., Soneji, S., Domanski,A., Farley, A., Sanjuan-Pla, A., Carella, C., Patient, R., de Bruijn, M., Enver, T., Nerlov, C.,Blackburn, C., Godin, I., and Jacobsen, S. E. (2012). The earliest thymic t cell progenitorssustain b cell and myeloid lineage potential. Nat Immunol, 13(4):412–9.

Ma, X., Shao, Y., Tian, L., Flasch, D. A., Mulder, H. L., Edmonson, M. N., Liu, Y., Chen, X.,Newman, S., Nakitandwe, J., Li, Y., Li, B., Shen, S., Wang, Z., Shurtleff, S., Robison, L. L.,Levy, S., Easton, J., and Zhang, J. (2019). Analysis of error profiles in deep next-generationsequencing data. Genome Biol, 20(1):50.

Mardis, E. R. (2013). Next-generation sequencing platforms. Annu Rev Anal Chem (Palo AltoCalif), 6:287–303.

Mardis, E. R., Ding, L., Dooling, D. J., Larson, D. E., McLellan, M. D., Chen, K., Koboldt,D. C., Fulton, R. S., Delehaunty, K. D., McGrath, S. D., Fulton, L. A., Locke, D. P., Magrini,V. J., Abbott, R. M., Vickery, T. L., Reed, J. S., Robinson, J. S., Wylie, T., Smith, S. M.,Carmichael, L., Eldred, J. M., Harris, C. C., Walker, J., Peck, J. B., Du, F., Dukes, A. F.,Sanderson, G. E., Brummett, A. M., Clark, E., McMichael, J. F., Meyer, R. J., Schindler, J. K.,Pohl, C. S., Wallis, J. W., Shi, X., Lin, L., Schmidt, H., Tang, Y., Haipek, C., Wiechert, M. E.,Ivy, J. V., Kalicki, J., Elliott, G., Ries, R. E., Payton, J. E., Westervelt, P., Tomasson, M. H.,Watson, M. A., Baty, J., Heath, S., Shannon, W. D., Nagarajan, R., Link, D. C., Walter, M. J.,Graubert, T. A., DiPersio, J. F., Wilson, R. K., and Ley, T. J. (2009). Recurring mutationsfound by sequencing an acute myeloid leukemia genome. N Engl J Med, 361(11):1058–66.

Martin, P., Ghione, P., and Dreyling, M. (2017). Mantle cell lymphoma - current standards ofcare and future directions. Cancer Treat Rev, 58:51–60.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella,K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M. A. (2010). The genome analysistoolkit: a mapreduce framework for analyzing next-generation dna sequencing data. GenomeRes, 20(9):1297–303.

Meissner, B., Kridel, R., Lim, R. S., Rogic, S., Tse, K., Scott, D. W., Moore, R., Mungall, A. J.,Marra, M. A., Connors, J. M., Steidl, C., and Gascoyne, R. D. (2013). The e3 ubiquitin ligaseubr5 is recurrently mutated in mantle cell lymphoma. Blood, 121(16):3161–4.

Meynert, A. M., Ansari, M., FitzPatrick, D. R., and Taylor, M. S. (2014). Variant detectionsensitivity and biases in whole genome and exome sequencing. BMC Bioinformatics, 15:247.

Meynert, A. M., Bicknell, L. S., Hurles, M. E., Jackson, A. P., and Taylor, M. S. (2013). Quanti-fying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioinformatics,14:195.

Mozos, A., Royo, C., Hartmann, E., De Jong, D., Baro, C., Valera, A., Fu, K., Weisenburger,D. D., Delabie, J., Chuang, S. S., Jaffe, E. S., Ruiz-Marcellan, C., Dave, S., Rimsza, L., Braziel,R., Gascoyne, R. D., Sole, F., Lopez-Guillermo, A., Colomer, D., Staudt, L. M., Rosenwald,A., Ott, G., Jares, P., and Campo, E. (2009). Sox11 expression is highly specific for mantlecell lymphoma and identifies the cyclin d1-negative subtype. Haematologica, 94(11):1555–62.

Musunuru, K., Pirruccello, J. P., Do, R., Peloso, G. M., Guiducci, C., Sougnez, C., Garimella,K. V., Fisher, S., Abreu, J., Barry, A. J., Fennell, T., Banks, E., Ambrogio, L., Cibulskis, K.,Kernytsky, A., Gonzalez, E., Rudzicz, N., Engert, J. C., DePristo, M. A., Daly, M. J., Cohen,J. C., Hobbs, H. H., Altshuler, D., Schonfeld, G., Gabriel, S. B., Yue, P., and Kathiresan, S.(2010). Exome sequencing, angptl3 mutations, and familial combined hypolipidemia. N EnglJ Med, 363(23):2220–7.

108

Nagel, D., Vincendeau, M., Eitelhuber, A. C., and Krappmann, D. (2014). Mechanisms andconsequences of constitutive nf-kappab activation in b-cell lymphoid malignancies. Oncogene,33(50):5655–65.

Nam, J. Y., Kim, N. K., Kim, S. C., Joung, J. G., Xi, R., Lee, S., Park, P. J., and Park, W. Y.(2016). Evaluation of somatic copy number estimation tools for whole-exome sequencing data.Brief Bioinform, 17(2):185–92.

Navarro, A., Clot, G., Royo, C., Jares, P., Hadzidimitriou, A., Agathangelidis, A., Bikos, V.,Darzentas, N., Papadaki, T., Salaverria, I., Pinyol, M., Puig, X., Palomero, J., Vegliante,M. C., Amador, V., Martinez-Trillos, A., Stefancikova, L., Wiestner, A., Wilson, W., Pott,C., Calasanz, M. J., Trim, N., Erber, W., Sander, B., Ott, G., Rosenwald, A., Colomer, D.,Gine, E., Siebert, R., Lopez-Guillermo, A., Stamatopoulos, K., Bea, S., and Campo, E. (2012).Molecular subsets of mantle cell lymphoma defined by the ighv mutational status and sox11expression have distinct biologic and clinical features. Cancer Res, 72(20):5307–16.

Ng, S. B., Turner, E. H., Robertson, P. D., Flygare, S. D., Bigham, A. W., Lee, C., Shaffer, T.,Wong, M., Bhattacharjee, A., Eichler, E. E., Bamshad, M., Nickerson, D. A., and Shendure,J. (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature,461(7261):272–6.

Nielsen, P. I., Hansen, S. V., Moller, M. B., Kristensen, T. K., Cedile, O., Hansen, M. C.,Ebbesen, L. H., Haaber, J., Abildgaard, N., and Nyvold, C. G. (2019). Sensitive quantificationof the intronless sox11 mrna from lymph nodes biopsies in mantle cell lymphoma. Leuk Res,78:1–2.

Nothnagel, M., Herrmann, A., Wolf, A., Schreiber, S., Platzer, M., Siebert, R., Krawczak, M.,and Hampe, J. (2011). Technology-specific error signatures in the 1000 genomes project data.Hum Genet, 130(4):505–16.

Nowell, P. C. and Hungerford, D. A. (1960). Chromosome studies on normal and leukemic humanleukocytes. J Natl Cancer Inst, 25:85–109.

Obr, A., Prochazka, V., Jirkuvova, A., Urbankova, H., Kriegova, E., Schneiderova, P., Vatolikova,M., and Papajik, T. (2018). Tp53 mutation and complex karyotype portends a dismal prognosisin patients with mantle cell lymphoma. Clin Lymphoma Myeloma Leuk, 18(11):762–768.

O’Brien, T. D., Jia, P., Xia, J., Saxena, U., Jin, H., Vuong, H., Kim, P., Wang, Q., Aryee, M. J.,Mino-Kenudson, M., Engelman, J. A., Le, L. P., Iafrate, A. J., Heist, R. S., Pao, W., andZhao, Z. (2015). Inconsistency and features of single nucleotide variants detected in wholeexome sequencing versus transcriptome sequencing: A case study in lung cancer. Methods,83:118–27.

Onaindia, A., Medeiros, L. J., and Patel, K. P. (2017). Clinical utility of recently identifieddiagnostic, prognostic, and predictive molecular biomarkers in mature b-cell neoplasms. ModPathol, 30(10):1338–1366.

Parkin, B., Erba, H., Ouillette, P., Roulston, D., Purkayastha, A., Karp, J., Talpaz, M., Ku-jawski, L., Shakhan, S., Li, C., Shedden, K., and Malek, S. N. (2010). Acquired genomic copynumber aberrations and survival in adult acute myelogenous leukemia. Blood, 116(23):4958–67.

Parla, J. S., Iossifov, I., Grabill, I., Spector, M. S., Kramer, M., and McCombie, W. R. (2011).A comparative analysis of exome capture. Genome Biol, 12(9):R97.

Pathak, P., Li, Y., Gray, B. A., May, W. S., J., and Markham, M. J. (2017). Synchronousoccurrence of chronic myeloid leukemia and mantle cell lymphoma. Case Rep Hematol,2017:7815095.

Petersen, B. S., Fredrich, B., Hoeppner, M. P., Ellinghaus, D., and Franke, A. (2017). Opportu-nities and challenges of whole-genome and -exome sequencing. BMC Genet, 18(1):14.

109

Puente, X. S., Jares, P., and Campo, E. (2018). Chronic lymphocytic leukemia and mantle celllymphoma: crossroads of genetic and microenvironment interactions. Blood, 131(21):2283–2296.

Queiros, A. C., Beekman, R., Vilarrasa-Blasi, R., Duran-Ferrer, M., Clot, G., Merkel, A., Raineri,E., Russinol, N., Castellano, G., Bea, S., Navarro, A., Kulis, M., Verdaguer-Dot, N., Jares,P., Enjuanes, A., Calasanz, M. J., Bergmann, A., Vater, I., Salaverria, I., van de Werken, H.J. G., Wilson, W. H., Datta, A., Flicek, P., Royo, R., Martens, J., Gine, E., Lopez-Guillermo,A., Stunnenberg, H. G., Klapper, W., Pott, C., Heath, S., Gut, I. G., Siebert, R., Campo, E.,and Martin-Subero, J. I. (2016). Decoding the dna methylome of mantle cell lymphoma in thelight of the entire b cell lineage. Cancer Cell, 30(5):806–821.

Quinlan, A. R. and Hall, I. M. (2010). Bedtools: a flexible suite of utilities for comparing genomicfeatures. Bioinformatics, 26(6):841–2.

Randen, U., Yri, O. E., Tierens, A., Heim, S., Beiske, K., and Delabie, J. (2011). Mantle celllymphoma with features of marginal-zone lymphoma. J Hematop, 4(1):7–11.

Robasky, K., Lewis, N. E., and Church, G. M. (2014). The role of replicates for error mitigationin next-generation sequencing. Nat Rev Genet, 15(1):56–62.

Rodler, E., Welborn, J., Hatcher, S., Unger, K., Larkin, E., Gumerlock, P. H., Wun, T., andRichman, C. (2004). Blastic mantle cell lymphoma developing concurrently in a patient withchronic myelogenous leukemia and a review of the literature. Am J Hematol, 75(4):231–8.

Rokita, J. L., Rathi, K. S., Cardenas, M. F., Upton, K. A., Jayaseelan, J., Cross, K. L., Pfeil, J.,Egolf, L. E., Way, G. P., Farrel, A., Kendsersky, N. M., Patel, K., Gaonkar, K. S., Modi, A.,Berko, E. R., Lopez, G., Vaksman, Z., Mayoh, C., Nance, J., McCoy, K., Haber, M., Evans,K., McCalmont, H., Bendak, K., Bohm, J. W., Marshall, G. M., Tyrrell, V., Kalletla, K.,Braun, F. K., Qi, L., Du, Y., Zhang, H., Lindsay, H. B., Zhao, S., Shu, J., Baxter, P., Morton,C., Kurmashev, D., Zheng, S., Chen, Y., Bowen, J., Bryan, A. C., Leraas, K. M., Coppens,S. E., Doddapaneni, H., Momin, Z., Zhang, W., Sacks, G. I., Hart, L. S., Krytska, K., Mosse,Y. P., Gatto, G. J., Sanchez, Y., Greene, C. S., Diskin, S. J., Vaske, O. M., Haussler, D.,Gastier-Foster, J. M., Kolb, E. A., Gorlick, R., Li, X. N., Reynolds, C. P., Kurmasheva, R. T.,Houghton, P. J., Smith, M. A., Lock, R. B., Raman, P., Wheeler, D. A., and Maris, J. M.(2019). Genomic profiling of childhood tumor patient-derived xenograft models to enablerational clinical trial design. Cell Rep, 29(6):1675–1689 e9.

Rowley, J. D. (1973). Letter: A new consistent chromosomal abnormality in chronic myelogenousleukaemia identified by quinacrine fluorescence and giemsa staining. Nature, 243(5405):290–3.

Royo, C., Salaverria, I., Hartmann, E. M., Rosenwald, A., Campo, E., and Bea, S. (2011).The complex landscape of genetic alterations in mantle cell lymphoma. Semin Cancer Biol,21(5):322–34.

Rutledge, R. G. and Cote, C. (2003). Mathematics of quantitative kinetic pcr and the applicationof standard curves. Nucleic Acids Res, 31(16):e93.

Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Marth, G.,Sherry, S., Mullikin, J. C., Mortimore, B. J., Willey, D. L., Hunt, S. E., Cole, C. G., Coggill,P. C., Rice, C. M., Ning, Z., Rogers, J., Bentley, D. R., Kwok, P. Y., Mardis, E. R., Yeh,R. T., Schultz, B., Cook, L., Davenport, R., Dante, M., Fulton, L., Hillier, L., Waterston,R. H., McPherson, J. D., Gilman, B., Schaffner, S., Van Etten, W. J., Reich, D., Higgins,J., Daly, M. J., Blumenstiel, B., Baldwin, J., Stange-Thomann, N., Zody, M. C., Linton, L.,Lander, E. S., Altshuler, D., and International, S. N. P. M. W. G. (2001). A map of humangenome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature,409(6822):928–33.

Salaverria, I., Royo, C., Carvajal-Cuenca, A., Clot, G., Navarro, A., Valera, A., Song, J. Y.,Woroniecka, R., Rymkiewicz, G., Klapper, W., Hartmann, E. M., Sujobert, P., Wlodarska, I.,Ferry, J. A., Gaulard, P., Ott, G., Rosenwald, A., Lopez-Guillermo, A., Quintanilla-Martinez,

110

L., Harris, N. L., Jaffe, E. S., Siebert, R., Campo, E., and Bea, S. (2013). Ccnd2 rearrange-ments are the most frequent genetic events in cyclin d1(-) mantle cell lymphoma. Blood,121(8):1394–402.

Samorodnitsky, E., Jewell, B. M., Hagopian, R., Miya, J., Wing, M. R., Lyon, E., Damodaran, S.,Bhatt, D., Reeser, J. W., Datta, J., and Roychowdhury, S. (2015). Evaluation of hybridizationcapture versus amplicon-based methods for whole-exome sequencing. Hum Mutat, 36(9):903–14.

Sathirapongsasuti, J. F., Lee, H., Horst, B. A., Brunner, G., Cochran, A. J., Binder, S., Quack-enbush, J., and Nelson, S. F. (2011). Exome sequencing-based copy-number variation and lossof heterozygosity detection: Exomecnv. Bioinformatics, 27(19):2648–54.

Schaffner, C., Idler, I., Stilgenbauer, S., Dohner, H., and Lichter, P. (2000). Mantle cell lymphomais characterized by inactivation of the atm gene. Proc Natl Acad Sci U S A, 97(6):2773–8.

Scheinin, I., Sie, D., Bengtsson, H., van de Wiel, M. A., Olshen, A. B., van Thuijl, H. F., vanEssen, H. F., Eijk, P. P., Rustenburg, F., Meijer, G. A., Reijneveld, J. C., Wesseling, P.,Pinkel, D., Albertson, D. G., and Ylstra, B. (2014). Dna copy number analysis of fresh andformalin-fixed specimens by shallow whole-genome sequencing with identification and exclusionof problematic regions in the genome assembly. Genome Res, 24(12):2022–32.

Schmitt, C., Balogh, B., Grundt, A., Buchholtz, C., Leo, A., Benner, A., Hensel, M., Ho, A. D.,and Leo, E. (2006). The bcl-2/igh rearrangement in a population of 204 healthy individuals:occurrence, age and gender distribution, breakpoints, and detection method validity. LeukRes, 30(6):745–50.

Schraders, M., Oeschger, S., Kluin, P. M., Hebeda, K., Schuuring, E., Groenen, P. J., Hansmann,M. L., and van Krieken, J. H. (2009). Hypermutation in mantle cell lymphoma does notindicate a clinical or biological subentity. Mod Pathol, 22(3):416–25.

Shao, A. W., Sun, H., Geng, Y., Peng, Q., Wang, P., Chen, J., Xiong, T., Cao, R., and Tang,J. (2016). Bclaf1 is an important nf-kappab signaling transducer and c/ebpbeta regulator indna damage-induced senescence. Cell Death Differ, 23(5):865–75.

Shi, W., Ng, C. K. Y., Lim, R. S., Jiang, T., Kumar, S., Li, X., Wali, V. B., Piscuoglio, S.,Gerstein, M. B., Chagpar, A. B., Weigelt, B., Pusztai, L., Reis-Filho, J. S., and Hatzis, C.(2018). Reliability of whole-exome sequencing for assessing intratumor genetic heterogeneity.Cell Rep, 25(6):1446–1457.

Shi, Y. and Majewski, J. (2013). Fishingcnv: a graphical software package for detecting rarecopy number variations in exome-sequencing data. Bioinformatics, 29(11):1461–2.

Simonsen, A., Hansen, M., Kjeldsen, E., Moller, P. L., Hindkjaer, J. J., Hokland, P., and Ag-gerholm, A. (2018). Systematic evaluation of signal-to-noise ratio in variant detection fromsingle cell genome multiple displacement amplification and exome sequencing. BMC Genomics,19(1):681.

Simonsen, A. T., Bill, M., Rosenberg, C. A., Hansen, M. H., Moller, P. L., Kjeldsen, E., Johansen,K. D., Ommen, H. B., Nederby, L., Aggerholm, A., Hokland, P., and Ludvigsen, M. (2020).Unraveling clonal heterogeneity at the stem cell level in myelodysplastic syndrome: In pursuitof cell subsets driving disease progression. Leuk Res, 92:106350.

Simonsen, A. T., Schou, M., Sorensen, C. D., Bentzen, H. H., and Nyvold, C. G. (2015). Sox11,ccnd1, bcl1/igh and igh-vdj: a battle of minimal residual disease markers in mantle cell lym-phoma? Leuk Lymphoma, 56(9):2724–7.

Simonsen, A. T., Sorensen, C. D., Ebbesen, L. H., Bodker, J. S., Bentzen, H. H., and Nyvold,C. G. (2014). Sox11 as a minimal residual disease marker for mantle cell lymphoma. LeukRes, 38(8):918–24.

111

Slotta-Huspenina, J., Koch, I., de Leval, L., Keller, G., Klier, M., Bink, K., Kremer, M., Raffeld,M., Fend, F., and Quintanilla-Martinez, L. (2012). The impact of cyclin d1 mrna isoforms,morphology and p53 in mantle cell lymphoma: p53 alterations and blastoid morphology arestrong predictors of a high proliferation index. Haematologica, 97(9):1422–30.

Speedy, H. E., Di Bernardo, M. C., Sava, G. P., Dyer, M. J., Holroyd, A., Wang, Y., Sunter,N. J., Mansouri, L., Juliusson, G., Smedby, K. E., Roos, G., Jayne, S., Majid, A., Dearden, C.,Hall, A. G., Mainou-Fowler, T., Jackson, G. H., Summerfield, G., Harris, R. J., Pettitt, A. R.,Allsup, D. J., Bailey, J. R., Pratt, G., Pepper, C., Fegan, C., Rosenquist, R., Catovsky, D.,Allan, J. M., and Houlston, R. S. (2014). A genome-wide association study identifies multiplesusceptibility loci for chronic lymphocytic leukemia. Nat Genet, 46(1):56–60.

Steensma, D. P., Bejar, R., Jaiswal, S., Lindsley, R. C., Sekeres, M. A., Hasserjian, R. P., andEbert, B. L. (2015). Clonal hematopoiesis of indeterminate potential and its distinction frommyelodysplastic syndromes. Blood, 126(1):9–16.

Stitziel, N. O., Kiezun, A., and Sunyaev, S. (2011). Computational and statistical approaches toanalyzing variants identified by exome sequencing. Genome Biol, 12(9):227.

Sun, S. C. (2012). The noncanonical nf-kappab pathway. Immunol Rev, 246(1):125–40.

Sørensen, J. F., Aggerholm, A., Kerndrup, G. B., Hansen, M. C., Ewald, I. K. L., Bill, M.,Ebbesen, L. H., Rosenberg, C. A., Hokland, P., Ludvigsen, M., and Stidsholt Roug, A.(2020). Clonal hematopoiesis predicts development of therapy-related myeloid neoplasms post-autologous stem cell transplantation. Blood Adv, 4(5):885–892.

Takahashi, N., Miura, I., Saitoh, K., and Miura, A. B. (1998). Lineage involvement of stemcells bearing the philadelphia chromosome in chronic myeloid leukemia in the chronic phaseas shown by a combination of fluorescence-activated cell sorting and fluorescence in situ hy-bridization. Blood, 92(12):4758–63.

Tan, R., Wang, Y., Kleinstein, S. E., Liu, Y., Zhu, X., Guo, H., Jiang, Q., Allen, A. S., andZhu, M. (2014). An evaluation of copy number variation detection tools from whole-exomesequencing data. Hum Mutat, 35(7):899–907.

Thorselius, M., Walsh, S., Eriksson, I., Thunberg, U., Johnson, A., Backlin, C., Enblad, G.,Sundstrom, C., Roos, G., and Rosenquist, R. (2002). Somatic hypermutation and v(h) geneusage in mantle cell lymphoma. Eur J Haematol, 68(4):217–24.

Vaisvilas, M., Dirse, V., Aleksiuniene, B., Tamuliene, I., Cimbalistiene, L., Utkus, A., andRascon, J. (2018). Acute pre-b lymphoblastic leukemia and congenital anomalies in a childwith a de novo 22q11.1q11.22 duplication. Balkan J Med Genet, 21(1):87–91.

Van Dyke, D. L., Shanafelt, T. D., Call, T. G., Zent, C. S., Smoley, S. A., Rabe, K. G., Schwager,S. M., Sonbert, J. C., Slager, S. L., and Kay, N. E. (2010). A comprehensive evaluation of theprognostic significance of 13q deletions in patients with b-chronic lymphocytic leukaemia. BrJ Haematol, 148(4):544–50.

Walter, C., Pozzorini, C., Reinhardt, K., Geffers, R., Xu, Z., Reinhardt, D., von Neuhoff, N., andHanenberg, H. (2018). Single-cell whole exome and targeted sequencing in npm1/flt3 positivepediatric acute myeloid leukemia. Pediatr Blood Cancer, 65(2).

Wang, J., Raskin, L., Samuels, D. C., Shyr, Y., and Guo, Y. (2015). Genome measures used forquality control are dependent on gene function and ancestry. Bioinformatics, 31(3):318–23.

Wang, M., Munoz, J., Goy, A., Locke, F. L., Jacobson, C. A., Hill, B. T., Timmerman, J. M.,Holmes, H., Jaglowski, S., Flinn, I. W., McSweeney, P. A., Miklos, D. B., Pagel, J. M., Kersten,M. J., Milpied, N., Fung, H., Topp, M. S., Houot, R., Beitinjaneh, A., Peng, W., Zheng, L.,Rossi, J. M., Jain, R. K., Rao, A. V., and Reagan, P. M. (2020). Kte-x19 car t-cell therapy inrelapsed or refractory mantle-cell lymphoma. N Engl J Med, 382(14):1331–1342.

112

Wang, X., Asplund, A. C., Porwit, A., Flygare, J., Smith, C. I., Christensson, B., and Sander, B.(2008). The subcellular sox11 distribution pattern identifies subsets of mantle cell lymphoma:correlation to overall survival. Br J Haematol, 143(2):248–52.

Warr, A., Robert, C., Hume, D., Archibald, A., Deeb, N., and Watson, M. (2015). Exomesequencing: Current and future perspectives. G3 (Bethesda), 5(8):1543–50.

Wiszniewska, J., Bi, W., Shaw, C., Stankiewicz, P., Kang, S. H., Pursley, A. N., Lalani, S.,Hixson, P., Gambin, T., Tsai, C. H., Bock, H. G., Descartes, M., Probst, F. J., Scaglia, F.,Beaudet, A. L., Lupski, J. R., Eng, C., Cheung, S. W., Bacino, C., and Patel, A. (2014).Combined array cgh plus snp genome analyses in a single assay for optimized clinical testing.Eur J Hum Genet, 22(1):79–87.

Wlodarska, I., Dierickx, D., Vanhentenrijk, V., Van Roosbroeck, K., Pospisilova, H., Minnei, F.,Verhoef, G., Thomas, J., Vandenberghe, P., and De Wolf-Peeters, C. (2008). Translocationstargeting ccnd2, ccnd3, and mycn do occur in t(11;14)-negative mantle cell lymphomas. Blood,111(12):5683–90.

Wu, C., de Miranda, N. F., Chen, L., Wasik, A. M., Mansouri, L., Jurczak, W., Galazka, K.,Dlugosz-Danecka, M., Machaczka, M., Zhang, H., Peng, R., Morin, R. D., Rosenquist, R.,Sander, B., and Pan-Hammarstrom, Q. (2016). Genetic heterogeneity in primary and relapsedmantle cell lymphomas: Impact of recurrent card11 mutations. Oncotarget, 7(25):38180–38190.

Xu, W. and Li, J. Y. (2010). Sox11 expression in mantle cell lymphoma. Leuk Lymphoma,51(11):1962–7.

Xu-Monette, Z. Y., Li, J., Xia, Y., Crossley, B., Bremel, R. D., Miao, Y., Xiao, M., Snyder, T.,Manyam, G. C., Tan, X., Zhang, H., Visco, C., Tzankov, A., Dybkaer, K., Bhagat, G., Tam,W., You, H., Hsi, E. D., van Krieken, J. H., Huh, J., Ponzoni, M., Ferreri, A. J. M., Moller,M. B., Piris, M. A., Winter, J. N., Medeiros, J. T., Xu, B., Li, Y., Kirsch, I., and Young,K. H. (2019). Immunoglobulin somatic hypermutation has clinical impact in dlbcl and poten-tial implications for immune checkpoint blockade and neoantigen-based immunotherapies. JImmunother Cancer, 7(1):272.

Yang, P., Zhang, W., Wang, J., Liu, Y., An, R., and Jing, H. (2018). Genomic landscape andprognostic analysis of mantle cell lymphoma. Cancer Gene Ther, 25(5-6):129–140.

Yu, C., Arcos-Burgos, M., Baune, B. T., Arolt, V., Dannlowski, U., Wong, M. L., and Licinio,J. (2018). Low-frequency and rare variants may contribute to elucidate the genetics of majordepressive disorder. Transl Psychiatry, 8(1):70.

Zare, F., Dow, M., Monteleone, N., Hosny, A., and Nabavi, S. (2017). An evaluation of copynumber variation detection tools for cancer using whole exome sequencing data. BMC Bioin-formatics, 18(1):286.

Zhang, B., Calado, D. P., Wang, Z., Frohler, S., Kochert, K., Qian, Y., Koralov, S. B., Schmidt-Supprian, M., Sasaki, Y., Unitt, C., Rodig, S., Chen, W., Dalla-Favera, R., Alt, F. W.,Pasqualucci, L., and Rajewsky, K. (2015). An oncogenic role for alternative nf-kappab signalingin dlbcl revealed upon deregulated bcl6 expression. Cell Rep, 11(5):715–26.

Zhang, J., Jima, D., Moffitt, A. B., Liu, Q., Czader, M., Hsi, E. D., Fedoriw, Y., Dunphy, C. H.,Richards, K. L., Gill, J. I., Sun, Z., Love, C., Scotland, P., Lock, E., Levy, S., Hsu, D. S.,Dunson, D., and Dave, S. S. (2014). The genomic landscape of mantle cell lymphoma is relatedto the epigenetically determined chromatin state of normal b cells. Blood, 123(19):2988–96.

Ziller, M. J., Hansen, K. D., Meissner, A., and Aryee, M. J. (2015). Coverage recommendationsfor methylation analysis by whole-genome bisulfite sequencing. Nat Methods, 12(3):230–2, 1 pfollowing 232.

113

Part IV

Thesis Papers

114

Paper I

115

A decade with whole exome sequencing in haematology

Marcus C. Hansen,1 Torsten Haferlach2 and Charlotte G. Nyvold1

1Hematology Pathology Research Laboratory, Research Unit for Hematology and Research Unit for Pathology, Odense University

Hospital, University of Southern Denmark, Odense, Denmark and 2MLL Munich Leukemia Laboratory GmbH, Munich, Germany

Summary

The first decade of capture-based targeted whole exome

sequencing (WES) has now passed, while the sequencing

modality continues to find more widespread usage in clinical

research laboratories and still offers an unprecedented diagnos-

tic assay in terms of throughput, informational content and

running costs. Until quite recently, WES has been out of reach

for many clinicians and molecular biologists, and it still poses

issues or is met with some reluctance with regards to cost versus

benefit in terms of effective assay costs, hands-on laboratory

work and data analysis bottlenecks. Although WES is used more

than ever, it may also be argued that the usage is peaking and

that new implementations, or relevance in its current state, will

likely be leveling off during the following decade as the price on

whole genome sequencing continues to drop. In this review, we

focus on the past decade of targeted whole exome sequencing in

malignant hematology. We thematically revisit some of the sig-

nificant discoveries and niches that use next-generation

sequencing, and we outline what and how WES has contributed

to the field – from clonal hematopoiesis of the aging bone mar-

row to profiling malignancies down to the single cell.

Keywords: whole exome sequencing, whole genome sequenc-

ing, targeted panel sequencing, next generation sequencing,

sequencing anniversary, laboratory hematology.

The human exome, which holds the protein coding informa-

tion, constitutes one to two percent of the total genome.

Errors in these regions of the genome are by far the largest

contributors to Mendelian diseases (Chong et al, 2015) and

acquired clonal diseases, such as cancer. Whole exome

sequencing (WES) has been instrumental in elucidating the

latter in the field of hematology and the complex landscapes

of hematological malignancies.

Illustrative paraphrasing, as exemplified above, frequently

finds usage in papers relating to WES but does not necessarily

capture the contributions and shortcomings of this ground-

breaking invention. Without much ado, the 10-year anniver-

sary of capture-based targeted exome sequencing is now

passing, while the technique continues to find more wide-

spread usage in clinical research laboratories and still offers

an unprecedented diagnostic assay in terms of throughput,

informational content and running costs. Until quite recently,

exome sequencing has been out of reach for many clinicians

and molecular biologists, and it still poses issues or is met

with some reluctance with regards to cost versus benefit in

terms of effective assay costs, hands-on laboratory work and

data analysis bottlenecks. In some situations, targeted panel

or amplicon sequencing is more suitable, e.g., in focused clin-

ical tests. Although WES is now used more than ever, it may

also be argued that this sequencing modality is peaking and

that new implementations, or usage in its current state, will

likely be leveling off during the next decade.

In fact, exomic sequencing was not initiated in 2009, but

several years earlier. Apart from the strategy of narrowing

whole genome sequencing analyses to the coding parts, one

interesting approach was the daunting design of PCR primers

targeting coding sequences of 18 191 genes in the study of

the mutational landscape of colorectal and breast cancers

(Wood et al, 2007). The jargon was further elaborated by

emphasizing that candidate cancer driver mutations do not

just appear as recurrent gene “mountains” but also as gene

“hills” that are large in number. The latter whimsically

described the long tail of low-frequent somatic mutations,

which is nicely exemplified by chronic lymphocytic leukemia

(CLL) in hematology. Looking back on the first decade of

whole genome sequencing, Vogelstein and fellow researchers

pondered the genomic landscapes of common cancers and

the lessons learned from these studies (Vogelstein et al,

2013), with an estimated price of more than $100,000 per

sequenced genome. First, the number of somatic mutations

is somewhat representative of the specific cancer type, which

is a notion still used today. Second, the number of mutations

increases with age. Third, the estimated number of canonical

mutated driver genes was at the time surprisingly small, with

125 defined based on the evaluation of more than three

Correspondence: Marcus C. Hansen, Odense University Hospital,

Hematology Pathology Research Laboratory, Research Unit for

Hematology and Research Unit for Pathology, University of

Southern Denmark, Odense, Denmark.

E-mail: [email protected]

review

ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

First published online 10 October 2019doi: 10.1111/bjh.16249

116

thousand tumors and a two orders of magnitude larger num-

ber of somatic mutations. Certainly, the “completion” of the

genome in 2003 marks an important reference point in the

early onset of the genomic revolution in cancer research.

Naturally, sequencing of the human genome was a massive

research undertaking that led to a public draft of the human

genome by the International Human Genome Sequencing

Consortium (Lander et al, 2001), made freely available in

2001 and finalized in 2003. In parallel, Venter et al published

their version of the genome (Venter et al, 2001) and esti-

mated that exons span 1!1% of the genome. One of the

major differences between these studies was the initial esti-

mation of the number of genes – an issue which remains

unsettled.

In hindsight, exome sequencing has been of tremendous

value, taught us much about the genome, and has shed light

on the pathobiology of a wide range of diseases, especially

cancers. The coding genome contains roughly twenty thou-

sand single-nucleotide variants (Choi et al, 2009; Ng et al,

2009). Not only is this an inherent feature of the human

genome, it is also a practical handle for quality assessment,

as too high or low a number of coding single-nucleotide

polymorphisms (SNPs) in a sample of a given population

indicates a problem with sequencing, library or DNA sample.

The typical or expected number of mutations of a specific

cancer type and age constitutes another example of such an

attribute. The number of mutations generally increases with

age, and the combinations of these mutations occur in a

nonrandom fashion (Welch et al, 2012). If the number of

coding variants corresponds to the number of genes (Choi

et al, 2009; Ng et al, 2009), then it is highly interesting that,

on average, one variant can be expected per gene, which is

also supported by data submitted in relation to exome

sequencing of colorectal cancer (Goryca et al, 2018). Further-

more, very few nonsense mutations are generally found by

sequencing (Choi et al, 2009).

While exome sequencing has taken a relatively long time

to find its way to the clinic, and for many laboratories, per-

haps it will never enter the diagnostic scene; in parallel, many

research studies have included massive undertakings of col-

lectively gathering of samples from thousands of individuals

(Genomes Project Consortioum et al, 2015). The Exome

Aggregation Consortium has managed to combine more than

sixty thousand exomes (Lek et al, 2016) from 22 projects,

with 1000 Genomes and The Cancer Genome Atlas included.

The hyperauthorships of the last two decades in life sciences

are indicators of the sheer size of such projects and the num-

ber of people involved. This brings around another conse-

quential phenomenon: the publication process of large

sequencing studies is extensive and calls for new publication

strategies, such as manuscript preprints, which has also been

the case for the Genome Aggregation Database. At the time

of writing, this holds the data for 125,748 exomes and 15,708

genomes from the same number of individuals (Karczewski

et al, 2019).

In this review, we focus on a decade of targeted WES in

the perspective of malignant hematology. We thematically

revisit some of the significant discoveries and niches that use

next-generation sequencing (NGS), and we briefly elucidate

what WES contributed to and how. As an example, it is now

realized that many mutations are shared between the lin-

eages, and some malignancies share both characteristics of

the myeloid and lymphoid phenotypes. Moreover, we will

touch on how evidence is gathered on the single-cell level in

a time when prices of whole genome sequencing push the

limits for what can be achieved in research and in the clinic.

WES and its use must thus be put into the context of partic-

ular research, clinical problems and other sequencing meth-

ods to fully appreciate its contribution (Fig 1).

Portraying the exome

Before delving further into the role sequencing has played in

elucidating derailed hematopoiesis, we will briefly focus on

more technical considerations in a historical context, which

is relevant when evaluating the sequencing studies. Most

often, WES is compared to whole genome sequencing

(WGS) and targeted panel or amplicon sequencing. It is also

compared to chain-terminating dideoxy sequencing, led by

Sanger (Sanger et al, 1977), which is still in use in clinical

laboratories after more than four decades. Only a few years

ago, results from next-generation sequencing were expected

to be positively confirmed by Sanger sequencing, which often

had a poorer sensitivity. Whole exome sequencing is the tar-

geted sequencing of short reads (Fig 2). These two attributes

are important, as they influence quality and possibilities in

research and clinically applied analyses – and they represent

both its strength and Achilles heel.

The coupling of exome capture arrays to Illumina’s

sequencing platform in 2009 marked the hot start of WES

(Choi et al, 2009). Armed with a protocol for microarray

hybridization followed by sequencing by synthesis on the lat-

est NGS offshoot (Genome Analyzer II, Illumina, San Diego,

CA, USA), Ng and colleagues at the University of Washing-

ton (Seattle, WA, USA) managed to capture and sufficiently

sequence approximately 26 megabases (Mb) of the coding

genome, based on twelve individuals (Ng et al, 2009). This

figure is assessed relative to the approximately 30 Mb size of

the exome, corresponding to approximately 1% of the

human genome. It was evaluated that the detection of SNPs

in the coding genome had equivalent sensitivity to that of

WGS (Ng et al, 2009). Not long after, the arrays were

replaced by in-solution capture of the coding genome (Bain-

bridge et al, 2010), showing reproducibility between technical

replicates and a library preparation protocol similar to the

ones currently used. The protocol was based on the Nim-

bleGen system (Roche, Basel, Switzerland). While not the

first paper describing the hybridization in solution instead of

a microarray, a previously performed study only targeted a

smaller fraction of the complete exome (Gnirke et al, 2009).

Review

368 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

117

In hematology, it soon became clear that even if the con-

trol-paired whole-genome resequencing of a cancer was feasi-

ble, in technical and economic terms, it was not necessarily

practical. Thus, the first implementations of NGS in hematol-

ogy can be characterized as pseudoexome sequencing. In

malignant hematology, the foremost usage of sequencing is to

detect somatic mutations, and to do so, the best conditions

are (i) evenly distributed reads throughout the regions of

interest, (ii) as high depth of coverage as possible of both the

malignant sample and the paired control sample, and (iii) as

low an error rate as possible to avoid false-positive mutations.

A random DNA library will, in theory, provide evenly dis-

tributed sequencing depth (Lander & Waterman, 1988). How-

ever, this is a coarse approximation in regard to exome

sequencing. Due to the exon target capture being based on

hybridization to oligonucleotide probes, the resulting reads

distribution for each exon will, at best, be bell-shaped, with

the highest number of reads in the center of the exon, leading

to differences in statistical power and downstream resolution.

The concern was, among other places, raised in PNAS in

2015 under the conclusive title that “Whole-genome sequencing

is more powerful than whole-exome sequencing for detecting

exome variants” (Belkadi et al, 2015). It also concluded that

WES is not reliable for the detection of copy number varia-

tion (CNV). Although such detection poses different chal-

lenges than relying on WGS, this is an oversimplification and

is influenced by several factors [see (Sathirapongsasuti et al,

2011; Fromer et al, 2012; Hansen et al, 2019) for examples].

While we will not go further into details on CNV detection, it

must be noted that the Broad Institute recently released a

stable pipeline and best practices documentation for germline

and somatic CNV calling (GATK 4.1, Broad Institute, Cam-

bridge, MA, USA). This pipeline relies on a provided panel of

normals, such as 40 exome sequenced samples without CNVs

to denoise the sample of interest.

Collectively, it is now commonly known that exome cap-

turing and PCR introduce bias (Parla et al, 2011; Braggio

et al, 2013; Meynert et al, 2014; Belkadi et al, 2015; Warr

et al, 2015). This affects one of the most important features

of WES, namely, variant allele frequencies. It is an important

point to raise, as sequencing projects often deduce clonal

architecture and burden from these data.

Implementations of exome sequencing inhematology

Several themes in hematology can be identified from the last

decade, including recurrent mutations, molecular stratifica-

tion of diagnoses, premalignant lesions and age-related clonal

hematopoiesis, allelic burden and detection of residual dis-

ease, intrapatient heterogeneity, and demonstration of sub-

clonal mutations. Early signs of implementing sequencing of

the coding genome in the field of hematology, and the

wealth of information to come, became evident with the first

whole cancer genome sequencing of a patient suffering from

acute myeloid leukemia (AML) with a normal karyotype

(Ley et al, 2008). This study preceded the year when capture-

based exome sequencing entered the scene and was followed

Whole exome array capture coupled with NGS

BRAF mutions

in HCL In-solution

exome captureand sequencing

55 DLCBLgenomesanalysed

29 MCL genomesor exomes

investigated WES of

paediatricALL

Discovery ofpremalignant clones

in AML by WES

Google cloud

launched

Amazon elastic compute cloud

launced

DNA sequencing in space (MiniON)

Clonal evolutionof CLL mapped

”$1000” genomecommercially

available between2015–2017

WES of 105 CLL cases

RNA sequencing in space

200920072006

2011 2017 2019Sequencing of AML genome

IlluminaacquiresSolexa

Genome of James D. Watson

sequenced for <$1 000 000?

Craig Venter’s genome $70 000 000? (Sanger sequencing)

“$300 000per genome“?

$100 000per genome?

Human Genome Project

draft(2001)

~$3 000 000 000in total?

Celeragenome

~$300 000 000in total?(2001)

Human Genome Project

Initiated (1990)

Kerry Mullis discovers PCR

(1983)

SangerSequencing

(1977)

Watson &Crick’s DNAstructure(1953)

Single cell sequencingis ”Method of the Year“

Targeted NGS of 944MDS patients

Persistentmutations

in 1/2 of 430 AML patientsin remission

$100 genome?

Early access onNanopore DNA

sequencing

Illumina HiSeq X Ten Sequencer:“$1 000 genome possible”?

Semiconductor sequencing of

exomes

genomes of 1 092 mapped

(1000G)

DRAGENplatform launched

Coding variation in 60 706 humans

(ExAC)

Aggregation of125 748 exomes15 708 genomes

(gnomAD)

Sequencing of 200 de novo AML

WES of 17 182 individuals

to investigateclonal hematopoiesis

2015

100 000G Projectlaunched (NHS)

2013Genome Analyzerlaunched

Genome Analyzer IIlaunched in 2008

~$48 000per genome?

(Illumina)

Fig 1. Timeline of postmillennial sequencing with selected milestones and curiosities. Whole exome sequencing only represents a fraction of thedevelopments and achievements in the genomic era. Notably, drafts of the human genome were published in 2001 accompanied by staggeringcosts. The Genome Analyzer II (Illumina) entered the market in 2008, the same year the Genome Analyzer was used to sequence the first cancergenome from a patient with AML and a year ahead of targeted whole exome sequencing. The price of genome sequencing has decreased 100- to1000-fold in the last decade, and we will likely experience the first commercially available $100 genomes in the next decade. Many innovationshave been left out due to space limitations.

Review

ª 2019 British Society for Haematology and John Wiley & Sons Ltd 369British Journal of Haematology, 2020, 188, 367–382

118

by another case of cytogenetically normal AML (Mardis et al,

2009). Here, the most important finding was that the initially

identified somatic mutation in IDH1, which substitutes argi-

nine at codon 132, could be observed in 16/188 additional

cases by amplicon sequencing. From a technical stance, the

correlation of sequenced variant allele frequencies, partly

from DNA and reverse-transcribed mRNA, was noteworthy.

The first sequencing of an AML patient already suggested

subclonal accumulations based on allele frequency, hereby

indicating that FLT3-ITD was acquired as a late mutation

(Ley et al, 2008). Many of the concepts used today have

existed since very early studies and have been further elabo-

rated and expanded.

The general adoption of NGS in hematology was perhaps

not quick to reach clinical laboratory diagnostics, but it was

certainly placed in an advantageous position for some of the

malignancies compared to other cancer forms. This is par-

tially because of the relative ease of sampling in leukemia

and leukemic phase lymphomas, established routine bone

marrow aspirations, etc. Furthermore, emergent research

involving NGS has rested on already established workflows

with regards to storage of DNA and RNA in biobanks, fre-

quent follow-up for remission, relapse evaluation and so on.

Clonal hematopoiesis of healthy and diseasedindividuals

While hematological malignancies arise from clonal hemato-

poiesis, one may argue that the line between cancer and non-

cancerous proliferation has perhaps become even more

diffuse with the information gathered from NGS studies. A

recent study in the New England Journal of Medicine con-

cluded that mutations persisted in approximately half of the

investigated AML cohort (221/430) at complete remission

(Jongen-Lavrencic et al, 2018). The persistence of clonal

mutations may be attributed to residual disease as well as

caused by age-related clonal hematopoiesis (Genovese et al,

2014; Jaiswal et al, 2014). The latter is reinforced by the

observation that lingering mutations in DNMT3A, TET2 or

ASXL1 do not confer a higher rate of AML relapse. The cur-

rent definition of clonal hematopoiesis of indeterminate

potential (CHIP) is generally defined as more than 2% vari-

ant allele frequency (Steensma et al, 2015) of driver genes

[see reviews (Sano et al, 2018; Steensma, 2018)] – a very low

threshold for WES.

It is now known that CHIP is a common phenomenon in

elderly people, occurring in 10% above 65 years of age

Exomecapture

ExtractDNA

Blood or bonemarrow sample

Adapter ligation

Sequencing

Amplify captured library

Alignment to reference,variant detection etc.

DNA

Dna shearing & size selection

Purified genomic dna

EXON EXONEXON

PCR

Obtain high quality dna(A) (B) (C)

Prepare library Amplify & sequence

Fig 2. Generalized workflow of whole exome sequencing. The first step is obtaining DNA of sufficient quality (A), as this affects all downstreamanalyses. Importantly, the results generated from formalin-fixed paraffin-embedded tissue, bulk DNA, sorted cells, etc., are not necessarily com-patible with the aim. Library preparation (B) is also a crucial step, where different workflows of shearing, size selection, adapter ligation and cap-ture will affect the final result. PCR amplification is used to enrich targeted DNA before sequencing (C). The researcher must be aware of thepotential bias introduced by amplification and be able to handle this. Furthermore, they must know that no two downstream algorithms will out-put the exact same sets of variants or even copy number variations.

Review

370 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

119

without any manifestations of a blood disorder (Genovese

et al, 2014). To avoid bias from false-positive variants arising

from technical errors and germline variants, the putative

somatic mutations in the mentioned study were selected

from an intermediate allele fraction. The three most frequent

and apparently mutated genes were DNMT3A, ASXL1 and

TET2, which are most frequently seen in myeloid malignan-

cies. Interestingly, Jaiswal et al performed low allele fre-

quency screening of 17,182 exomes without paired normal

controls (Jaiswal et al, 2014) but instead used a panel of con-

trols and a focused cancer variant list from the COSMIC

database. From a technical point of view, false-positives are

expected to occur in a substantial proportion of the variant

calls when approaching a lower variant allele frequency

[VAF, threshold defined as 3!5% for single-nucleotide vari-

ants (Jaiswal et al, 2014)] – an indisputable problem in WES.

Back then, the applied somatic caller, MuTect, only pro-

cessed single-nucleotide variants, so small indels had to be

detected by other means. The third study, published the

same year, had a similar approach (Xie et al, 2014) based on

malignant and matched controls from The Cancer Genome

Atlas but divided the VAFs into two categories; 2–10% and

one above 10%, calling the mutational analysis unbiased due

to high coverage. However, evidence was gathered to show

that WES is not unbiased (Parla et al, 2011; Meynert et al,

2013; Meynert et al, 2014; Warr et al, 2015). Although push-

ing exome sequencing to its extremes is challenging, bioin-

formatically, the lesson learned in 2014 was clear: clonal

hematopoiesis is indisputably linked to the aging genome –whether confounding or directly caused by underlying patho-

logical conditions – and confers an increased risk of develop-

ing a hematological malignancy of approximately 1% per

year (Genovese et al, 2014; Jaiswal et al, 2014). The number

of included whole exome sequenced samples in these studies

was unprecedented in the field of hematology, with the col-

lective effort drawing data from more than thirty thousand

screenings.

WGS, WES and targeted panel sequencing have played a

large role in elucidating noncancerous clonal hematopoiesis –a phenomenon known long before NGS entered the laborato-

ries and occurring in both lymphoid and myeloid lineages.

The continuum of benign clonal expansions to the evolution

of a malignant phenotype by accumulating genomic aberra-

tions was readily comprehensible from clinicians’ point of

view; a direct genomic association between monoclonal B cell

lymphocytosis preceding CLL or monoclonal gammopathy

preceding multiple myeloma, etc., became clear. It seems fair

to say that no other malignancy in the genomic age has

taught hematologists more about cancer heterogeneity, and

the longitudinal increment of lesions behind clonal evolution

leading a frank malignancy, than CLL. We refer to the recent

and thorough review by Crassini et al on the molecular

pathogenesis of CLL (Crassini et al, 2019). This paper also

ponders on an important prognosticator where WES falls

short, namely immunoglobulin gene rearrangement.

Mapping susceptibility towards clonallymphoproliferative and myeloproliferativedisorders

Although it is impossible to cover all the nooks and crannies of

exome sequencing and malignant hematology during the last

decade, one research area on the borderline of malignant and

benign leukocytosis deserves to be mentioned: the genetic pre-

disposition towards leukemia. The same year as the first publi-

cations on exome sequencing entered the stage, a short paper

entitled “Familial Chronic Lymphocytic Leukemia: What Does it

Mean to Me?” aimed to raise awareness of the genetic suscepti-

bility and to guide practitioners (Slager & Kay, 2009).

Although overall low, familial CLL and the risk of developing

lymphoproliferative diseases had been known for many years

(Houlston et al, 2003), but the genetic etiology was yet

unknown. As a concurrent study had just disproved the associ-

ation between a suspected polymorphic variant in CXCR4

(Crowther-Swanepoel et al, 2009) based on large-scale Sanger

sequencing, the problem was posed for genome-wide associa-

tion studies (GWAS). Nevertheless, the first GWAS utilized

arrays and not NGS. Several impressive studies were con-

ducted, which identified risk loci for CLL, such as in BCL2.

One interesting finding was that several of the variants were

common in the population and were thus detectable by SNP

arrays (Crowther-Swanepoel et al, 2010; Slager et al, 2012;

Berndt et al, 2013). In addition, some of the loci were outside

coding regions and, thus, not suited for exome screening. In

contrast, the role of mutated POT1 as a frequent contributor

in the development of CLL was initially discovered by WES

(Quesada et al, 2011; Ramsay et al, 2013) and shortly thereafter

extended to also hold a susceptibility locus by GWAS (Speedy

et al, 2014). Because more recent studies focus on somatic vari-

ants, the initial genotyping of cases in relation to heritable risk

has been extremely important in the struggle to understand the

complex leukemogenesis of CLL seen from single cases [as

exemplified by Hansen et al (2015)]. As with CLL, multiple

myeloma is characterized by a preceding monoclonal lympho-

proliferative disorder, and for this entity, there are now indica-

tions of an inherited risk of developing myeloma through the

analyses of germline WES data and possibly disease-causing

alleles (Scales et al, 2017). Not only have lymphoproliferative

disorders been significant in mapping familial cases (Goldin

et al, 2013), but familial myeloproliferative neoplasms and

AML have also received attention in the last decade. Myelopro-

liferative neoplasms are genomically complex and extend

beyond JAK2, MPL and CALR mutations. Because outlining

hereditary predispositions becomes a lengthy treatise, we refer

to the previous publication in this journal on this specific topic

(Rumi & Cazzola, 2017). More generally, we refer to Houlston

et al for the extensive work on familial cases and predisposition

towards hematological malignancies.

CEBPA is a frequently mutated gene in AML but is also

known to be one of the genes implicated in familial cases

(Smith et al, 2004; Pabst et al, 2008). One caveat on this

Review

ª 2019 British Society for Haematology and John Wiley & Sons Ltd 371British Journal of Haematology, 2020, 188, 367–382

120

driver pertains to the challenges in targeted sequencing (Yan

et al, 2016), partly owing to its high guanine-cytosine con-

tent, which may be problematic in PCR amplification (Man-

nelli et al, 2017); therefore, the findings often requires

confirmation by other modalities until the sequencing quality

is adequate, as exemplified by Tawana et al (2015). One

important lesson from the past ten years is that not all genes

are covered equally well or unbiasedly in WES due to cap-

ture, amplification or sequence homology. Examples of the

latter are members of the Tet methylcytosine dioxygenase

family, e.g., TET2, or the most notable members of the Ras

GTPases, NRAS, KRAS and HRAS. Other important lessons

from WES studies have involved Down syndrome and the

predisposition towards both myeloid and lymphoid malig-

nancies (Nikolaev et al, 2013; Yoshida et al, 2013; Schwartz-

man et al, 2017; Vesely et al, 2017).

Sequencing branched into the myeloid lineage

Few cancer forms have received the same amount of atten-

tion as acute myeloid leukemia. Not only have recurrent

mutations in the initiation of leukemogenesis been elucidated

extensively but so have premalignant genomic lesions (Welch

et al, 2012), which now receive attention in the form of per-

sistent mutations in patients in remission, as demonstrated

by targeted sequencing.

Going back to the beginning of next-generation coding

sequencing in hematology, the focus on coding variants was

reasonable, as these variants were already picked as the top

tier when evaluating somatic mutations (Mardis et al, 2009),

even when whole genome sequencing was performed. NGS

already had a steady hold in hematology by the time the first

exome sequencing studies began to appear. This is perhaps

best exemplified by Mardis et al, who sequenced almost the

complete genome of a patient suffering from AML with min-

imal maturation and succeeded in showing that the muta-

tions found also occurred recurrently in a cohort of no fewer

than 188 patients. One such mutation was in the IDH1 gene,

ubiquitously found in myeloid malignancies, along with pre-

viously identified mutations in NRAS and NPM1. This

spurred the hunt for new aberrations by other groups and

their mutational frequency in AML. The restriction of down-

stream analyses to coding regions was also the elaborate

strategy used the previous year, in 2008, when the first whole

cancer genome was sequenced (Ley et al, 2008).

Collectively, sequencing of AML has been an excellent

example of a joint scientific effort. The new possibilities aris-

ing from sequencing led to bursts of activity in characterizing

new genes involved in leukemogenesis across multiple diag-

noses, such as SRSF2 or SF3B1 (Yoshida et al, 2011; Meggen-

dorfer et al, 2012). Shortly after proving the involvement of

SF3B1 in myelodysplastic syndromes (MDS) with increased

ring sideroblasts (Yoshida et al, 2011), it was shown to con-

fer reduced cytopenia and a good prognosis in MDS

(Papaemmanuil et al, 2011). It was also found to be

recurrent in CLL as well but was, in contrast, associated with

poor overall survival (Quesada et al, 2011).

The blood-forming stem cells in the bone marrow accu-

mulate genomic lesions, leading to senescence or even can-

cer. As these cells proliferate to replenish the different cell

types constituting the blood and immune system, with an

estimated frequency of once every 25–50 weeks (Catlin

et al, 2011), the early mutations propagate along the matu-

ration of the lineages. The fact that the hematopoietic stem

cells (HSCs) of the bone marrow display an increasing

number of somatic mutations as a result of aging is now

indisputable. Equipped with exomes of hematopoietic stem/

progenitor cells from individuals in different age groups,

Welch and colleagues were able to assess the underlying

kinetics with a mutation rate of 0!13 coding mutations per

year (Welch et al, 2012) – sensible in the view of the fol-

lowing massive sequencing studies and a practical mental

cue for the observed median number of 13 coding muta-

tions in de novo AML (Cancer Genome Atlas Research

Network et al, 2013).

The emerging use of exome sequencing could have fore-

casted the demise of panel sequencing. This, however, was

not the case, and indeed, one may argue that recent years

have not set the stage for WES but that of targeted sequenc-

ing of driver genes in myeloid or lymphoid malignancies.

Much work has focused on fortifying the earlier results and

elucidating low-frequency variants in residual disease. While

the number of driver genes directly involved in the leukemo-

genesis is relatively small and is thus a comprehensible cata-

logue for the persistent hematologist, the possible

combinations – including positions and types of aberrations

– are staggering (Cancer Genome Atlas Research Network

et al, 2013; Papaemmanuil et al, 2016). It is now known that

such signatures identify molecular subgroups. Some muta-

tions are found to be narrowly defined hotspots, such as in

IDH1 or IDH2, while others are scattered over a large area,

such as TET2 mutations. The latter observation justifying the

wish for comprehensive sequencing as predefined panels will

only cover a fraction of the need. After a period hallmarked

by productivity but before WES became mainstream, many

researchers and clinical laboratories turned to panel sequenc-

ing. The seemingly little new knowledge to be gained by

exome sequencing and the difficulties in comprehending the

broad data output for the clinicians emphasized the useful-

ness of offering panel sequencing for diagnostic purposes

(Roug et al, 2014). The focused panels enabled much deeper

sequencing, and the overlapping genomic aberrations across

disease entities in the lymphoid or myeloid lineages and pre-

vious studies made it possible to characterize the genomic

profiles and classify patients into molecular or prognostic

subgroups, as has been done for MDS (Haferlach et al,

2014). In this particular study, investigating the landscape of

lesions in 944 patients with a panel of 104 genes, a whole

exome approach would have provided poor subclonal resolu-

tion and may not have provided much more information for

Review

372 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

121

the specific task based on preexisting extensively character-

ized drivers (Papaemmanuil et al, 2013).

Even though NGS was thought to clarify the molecular

characterization, making it easier to classify the different

diagnoses within malignant hematology, which to some

extent it has, this not turned out to be a straightforward task.

Consequently, heterogeneity or clonal heterogeneity has been

one of the buzzwords of the last decade. Even a readily iden-

tifiable form of leukemia, such as acute promyelocytic leuke-

mia (APL), presented itself as a heterogeneous disease

(Ibanez et al, 2016). Not only has WES differentiated

patient-specific sets of mutations with known alterations

from other leukemias or lymphomas, but microRNA is also

altered in APL (Cancer Genome Atlas Research Network

et al, 2013). Therefore, exome sequencing only refers to a

small part of the aberrant genome in holistic characterization

of the hematological neoplasm. This has been unquestionably

notable in the lymphoid diseases, which are marked by a

high degree of molecular heterogeneity.

Some major contributions in genomic profilingof the lymphoid malignancies

While the first cancer genome sequenced was from a case of

AML and the myeloid neoplasms have received much atten-

tion in the last decade, no lymphoma has undergone the

same scrutiny as the diffuse large B cell lymphomas (DLBCL,

Fig 3). Although several recurrent mutations were known

before the advent of exome sequencing, its genomically

heterogeneous nature has required large analysis cohorts,

such as those in the recent paper published in the New Eng-

land Journal of Medicine last year, where mutational and

expressional profiles were coupled to identify the subgroups

and differentiation of DLBCL (Schmitz et al, 2018). When

considering the expected number of somatic mutations in

DLBCL, which is somewhat higher than in AML and other

hematological malignancies (Lohr et al, 2012; Zhang et al,

2013; Lawrence et al, 2014; Tokheim et al, 2016), it is a suit-

able candidate for machine learning techniques in computer-

aided profiling and diagnostics. In contrast, the powerful

approach by Reddy et al, portraying the landscape of DLBCL

in a cohort 1,001 patients by WES and RNAseq with func-

tional genomics (Reddy et al, 2017), showed just close to 8

mutations per case and 150 putative driver genes, collectively.

Interestingly, the paper also concluded that CRISPR-Cas9

knockout of several therapeutically targetable genes, such as

NOTCH2, did not alter the growth of DLBCL. This may

indicate that driver redundancy makes precision medicine a

tricky endeavor in some cases. Another quite recent study

also showed that the mutational profiles from NGS can be

implemented as a predictor for subtype classification (Sch-

mitz et al, 2018). The clear advantage of such an approach is

that such mutational predictors used in clinical settings may

be less sensitive to fluctuations in expression patterns due to

sample processing, laboratory batch effects and

computational analyses between laboratories when compared

to RNA expression profiles. More rare lymphomas, such as

mantle cell lymphoma, show that there is still a need for sys-

tematic molecular characterization, where NGS will continue

to play a major role.

Multiple myeloma (MM), which is caused by cancerous

mature plasma cells, is a very heterogeneous disease. Part of

this heterogeneity may be directly attributed to differences in

sampling (Sanoja-Flores et al, 2018) and cellular impurities.

Thus, the malignant cells or plasma cells with monoclonal

gammopathy of undetermined significance must be consis-

tently positively selected, e.g., CD138+, from peripheral

blood (Lohr et al, 2016) or bone marrow. While it is

acknowledged that a large part of the driving somatic alter-

ations either occurs in or directly involves the coding regions

of the genome, it is also generally accepted that sequencing

focusing on these parts can only reveal the molecular onco-

genesis to a limited extent. This realization is emphasized by

the first large-scale study based on WGS and WES, which

identified 35 nonsynonymous somatic point mutations on

average, which is in stark contrast to the estimated number

of genome-wide mutations of in the thousands (Chapman

et al, 2011). Multiple myeloma is noted to be the second

most common hematological cancer form in adults relative

to Non-Hodgkin’s lymphoma as a single entity; thus, it is

important to elucidate the aberrant molecular pathways of

this plasma cell disease. Apart from the conclusion that the

underlying mutational patterns of myeloma harboring t(4;14)

and t(11;14) are distinct (Walker et al, 2012), a previous

study also shed light on the clonal diversity and kinetics

based on the results of exome sequencing and single-cell

genotyping. Importantly, the variant allele frequencies of

WES were in striking agreement with the fractions observed

from the single cell, where the ATM kinase was mutated in

all three subclones, probably indicating an early step in onco-

genesis. Multiple myeloma displays few highly recurrent

genes (Bolli et al, 2014), such as KRAS, NRAS and BRAF, in

one-third of the patients, and mutations in these genes are

considered to be mutually exclusive (Walker et al, 2012).

Notably, some discrepancies between exome and whole gen-

ome data were identified in the aforementioned study.

WES has not only provided a tool for performing large

scale screening, but it has also often played an indirect role

in the insight into the genomic landmarks of individual

hematological neoplastic entities. One such example is a

diagnostically robust marker, the BRAF mutation (p.V600E),

which is known from other cancer forms such as melanoma.

A single case of hairy cell leukemia (HCL) sequenced along

with paired mononuclear control cells was enough to initiate

the investigation of mutational frequencies in HCL. This was

also a time when NGS was validated by Sanger sequencing;

thus, the findings were confirmed and fortified by the addi-

tional observation in 46/46 other HCL patients (Tiacci et al,

2011). Exome sequencing has also had a tremendous impact

on the molecular understanding of acute lymphoblastic

Review

ª 2019 British Society for Haematology and John Wiley & Sons Ltd 373British Journal of Haematology, 2020, 188, 367–382

122

leukemia (ALL), which is difficult to capture in a brief

review. The recurrent mutations of both B- and T-cell ALL

bear markers, which otherwise are indicative of a myeloid

disorder (Zhang et al, 2012; Ding et al, 2017; Liu et al,

2017), such as FLT3, along with classical lymphocytic path-

way members. It is suggested that FLT3 mutations point to

an immature pluripotent prothymocyte and that such

patients with stem cell-like leukemia may benefit from tyro-

sine kinase inhibitors (Neumann et al, 2013a), which may

improve their prognosis (Zhang et al, 2012).

The degree of complexity has definitely not been reduced

with the screening of genomes and exomes. However, while

the myeloid and lymphoid branches have been clearly

divided in the classical model of hematopoiesis and perme-

ated the clinical view, it is now recognized that certain

lesions in immature cells with some degree of pluripotency

are capable of giving rise to mixed lineages, biphenotypic

leukemias or malignancies stuck in early precursor differenti-

ation. Early T-cell precursor acute lymphoblastic leukemia

(ETP-ALL) is a molecular manifestation of such a potential.

If one gene were to capture the ambiguous nature of the

neoplasms derived from B- and T-lineage lymphoid precur-

sors, the Ikaros transcription factor IKZF1 is a strong candi-

date for such a gene, as it is a master regulator of lymphocyte

differentiation. Although primarily recognized for its role in

B-ALL (Marke et al, 2018), growing evidence supports that

IKZF1 plays a cardinal role in ETP-ALL (Zhang et al, 2012;

Hansen et al, 2018). Additionally, germline variants have

been investigated in the predisposition of developing child-

hood acute lymphoblastic leukemia (Churchman et al, 2018).

Other mutational markers of ETP-ALL have clear-cut

overlaps with the myeloid leukemias, such as DNMT3A

(Neumann et al, 2013b), FLT3 or mutations in the JAK-

STAT pathway.

One latecomer in sequencing is classical Hodgkin’s lym-

phoma (cHL) (Rotunno et al, 2016; Mata et al, 2017). The

molecular pathogenesis is still not fully charted, which may

partially have been influenced by improved prognosis. This

may have also been affected by the historical ease of diagno-

sis from the view of a hematopathologist: those which are

Hodgkin’s, and the others which are not – a classification dif-

ferent from that used in genomics. One specific challenge in

genomic characterization of cHL is the small fraction of

malignant cells in the bulk tumor mass. Because WES is not

directly geared for detection of 1% malignant cells, the

methodology calls for cell sorting. By using an optimized

exome library preparation Reichel et al could perform

sequencing on 10 nanograms of Hodgkin and Reed-Sternberg

cell DNA and thereby obviate the need for whole genome

amplification (Reichel et al, 2015).

Profiling rare subclones and low-frequencymalignant cells down to the single-cell level

One of the advances of the last decade has been single-cell

analysis, which is – at least in theory – well-suited for the

characterization of stem cells. It was realized that using WES

for the identification of mutational fingerprints was an excel-

lent base for the development of techniques with a high

enough resolution for the evaluation of low burden malig-

nancy, such as the quantification of minimal or measurable

residual diseases in patients with clinical remission. While

0 50 100 150 200

Leukemia (2011)Leukemia, myeloid (2011)

Lymphoma (2011)Lymphoma, Non-hodgkin (2011)

Leukemia, Lymphoid (2011)Leukemia, myeloid, acute (2011)

Lymphoma, B-Cell (2011)Lymphoma, Large B-Cell, diffuse (2012)

Leukemia, Lymphocytic, Chronic, B-Cell (2011)Precursor cell lymphoblastic Leukemia-lymphoma (2015)

Lymphoma, T-Cell (2012)Lymphoma, follicular (2011)

Leukemia, myelogenous, chronic, BCR-ABL positive (2013)Lymphoma, Mantle-Cell (2013)

Lymphoma, B-Cell, Marginal zone (2014)Leukemia, T-Cell (2012)

Leukemia, Biphenotypic, acute (2016)

MeshTerms Number of papers (PubMed)

Fig 3. The number of published papers in the PubMed database covering the selected MeSH Term subjects and containing the term exomesequencing (May 2019). The graph shows the relative abundance and thus focuses on a given hematological malignancy. The first occurrence inthe database is shown in parentheses. For comparison, capture-based exome sequencing was published in 2009, and the first cancer genome paperwas published in 2008.

Review

374 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

123

WES is suited for the identification of mutation in driver

genes, once identified, focused deep sequencing in combina-

tion with cell sorting may be implemented to pinpoint rare

malignant cells (Malmberg et al, 2017). Despite major efforts,

few concepts in hematology are as pervasive and yet so

vaguely defined as the hematopoietic stem cell. However, the

very concept of stemness is extended to the self-renewal and

differentiation capacities of the leukemias (Jordan, 2007).

Thus, the single-cell analyses fit the proposal that the transi-

tion of the HSC to frank leukemia is defined by preleukemic

lesions of the stem cell. An accumulation of somatic muta-

tions and clonal progression of de novo AML were analyzed

even before the genomic landscape of such cases was mapped

(Jan et al, 2012). Exome sequencing became highly relevant

in the search for single leukemic cells, as the genomes of

these malignant cells, which, in contrast to normal cells, are

capable of self-renewal, are marked by lesions that are often

in the coding regions of the genome (Jan et al, 2012). The

subject is only briefly touched upon, as these efforts more

often belong to transcriptome sequencing (Yashiro et al,

2009; Lai et al, 2018), which has also been utilized in profil-

ing clonal architecture in childhood ALL by single-cell

sequencing (SCS) (Gawad et al, 2014).

Foremost, the sequencing studies have been used to delin-

eate the occurrence, prevalence and successions of genomic

lesions in hematological malignancies in the context of mye-

loid and lymphoid lineages. Together with the maturation of

single-cell analyses, it has, in principle, become possible to

characterize the shared hematopoietic stem cell (Baron et al,

2018) by parallel sequencing, thereby coming closer to por-

traying the immature cells and early progenitors. Neverthe-

less, researchers in the field of hematology have been puzzled

by the many occurrences of concurrent or successive myeloid

and lymphoid malignancies, which otherwise have been

marked by a wide cleft between the lineages in earlier

research. To a large extent, this has been driven by sequenc-

ing technologies developed in the last two decades, whereas

the usage of targeted exome sequencing is confined to the

last 10 years.

Next-generation sequencing has helped define and shape

the field of single-cell research, and some of the most impor-

tant publications using SCS have focused on specific prob-

lems in hematology. As early as 2012, a study identified

mutations in both leukemic cells and preleukemic

hematopoietic stem cells based on the results from an initial

exome screening of sorted cells and paired CD3+ control

(Jan et al, 2012). Instead of whole genome amplification,

which is largely the current method in SCS, the cells were

sorted into plates with subsequent culturing and genotyping.

By the time the first report on whole AML genome sequenc-

ing was published, another paper on advances and patents in

NGS briefly stressed the importance of developing single-cell

sequencing technologies and described the potential amplifi-

cation of a whole genome by multiple displacement amplifi-

cation (Lin et al, 2008). The low signal-to-noise and allelic

drop-outs still pose a challenge in SCS studies (Gawad et al,

2016; Alves & Posada, 2018; Simonsen et al, 2018), and cap-

ture-based targeted sequencing may suffer even more than

whole genome or RNA sequencing due to its additional steps

in sample processing. As single-cell sequencing evidently has

several issues to tackle, such as biased capture, partial or

complete drop out of targeted regions, and base substitu-

tions, transcriptome sequencing has become a popular alter-

native. Whereas transfer of information from the two copies

of DNA in a single cell is a delicate process, sequencing a

wealth of transcripts is statistically attractive.

Whereas a large number of patients were included in some

of the first WES studies, the current studies may also focus

on a few individuals with a large throughput of single cells

with state-of-the-art microfluidics (Pellegrino et al, 2018) or

sorting with subsequent whole-genome amplification (Walter

et al, 2018).

Discussion

Next-generation sequencing has helped cancer researchers to

understand how clones evolve and that subclones often exist

side by side. Solid tumors are found at the most extreme

end, with intratumoral heterogeneity reflected by different

sets of spatially confined somatic mutations (Gerlinger et al,

2012). In the timespan between the 2008 and the 2016 World

Health Organization (WHO) classification of hematopoietic

and lymphoid tumors, a riveting transition took place: the

field of hematology entered the genomic era.

Whole exome sequencing, as a technology, has con-

tributed immensely to the field of hematology in the last

10 years, and many of the pivotal findings were presented

within the first five years of its existence. These concepts

matured in the following years, and themes changed to

highly specific niches, such as single-cell sequencing, the

field of clonal hematopoiesis of indeterminate potential and

detection of minimal residual disease, to mention a few.

However, it is highly plausible that WGS will gradually, and

perhaps completely, replace WES during the next decade.

This is already occurring, and there are several factors influ-

encing this development. First, it may be argued that exome

sequencing is not inexpensive. In addition to proprietary

kits, the protocols require additional manual labor.

Although the costs involved in capture-based exome

sequencing have also decreased in recent years, WGS has

gained the largest advantage. In 2008, the details of the

genome of James D. Watson, co-discoverer of the DNA

double helix structure, were published (Wheeler et al,

2008). Sequenced the previous year and completed in two

months, the estimated price was less than $1 million – a

price tag that has since decreased a hundred- to a thou-

sand-fold. Effectively, WGS is catching up to its smaller

cousin, now that the $1,000 genome with 30x coverage is

readily available. Second, technical bias is introduced

through exon capture and target amplification by

Review

ª 2019 British Society for Haematology and John Wiley & Sons Ltd 375British Journal of Haematology, 2020, 188, 367–382

124

polymerase chain reaction (PCR), which is also an issue

with panel sequencing, but not in WGS protocols. Simply

increasing the sequencing depth for a sample, which is

highly feasible with the current low cost per sequenced

base, does not alone alleviate the problems of false-positive

variant detection. As sequencing offers a tool for the detec-

tion of low-frequency subclones or measurable residual dis-

ease detection and characterization in otherwise complete

remission, reducing technical bias, sample cross-talk or

noise is highly relevant for its clinical applicability. Adding

unique molecular identifiers (UMIs) to the DNA library will

help to determine and map the extent of PCR bias in a

sequenced sample (Best et al, 2015; MacConaill et al, 2018),

but this extension is still not commonly adopted in library

preparation or downstream bioinformatics. As UMIs have

been used and studied extensively in RNA and single-cell

sequencing (Sena et al, 2018), this will likely become main-

stream in cancer diagnostics for increased precision in tar-

geted and whole genome sequencing. The general consensus

that whole genome sequencing is superior to exome

sequencing in terms of quality is currently forming (Wang

et al, 2017), but WGS still suffers from a shallow sequenc-

ing depth, which is typically one-third that of WES or even

lower. In comparison, a site-specific depth of panel or

amplicon sequencing may be two to three orders of magni-

tude higher, and consequently, observations of a few hun-

dred are often discarded to minimize false-positive variant

calls arising from PCR, among other considerations.

While depth of coverage is used as a quality measure of

WES it does not capture the variability of sequencing reads

along the exome, which may be detrimental for clinical test-

ing. Reporting that a certain proportion of the bases, e.g.

90%, reach the specified threshold is useless if relevant onco-

genes, such as CEBPA, only attain sporadic coverage at criti-

cal sites owing to a high GC percentage (Kong et al, 2018;

Roy et al, 2018), poor primer design etc. Currently, several

library preparation kits exist, such as SureSelect Clinical

Research Exome (Agilent, Santa Clara, CA, USA) or SeqCap

EZ MedExome (Roche, Basel, Switzerland), which combines

the genomic breadth of WES with a higher depth of coverage

in medical relevant regions. These extensions to classical tar-

geted WES may thus provide higher resolution in somatic

variant calling, copy gains, deletions, and other structural

variants when needed.

An argument against the adoption of WGS is the quantity

of data being generated and its bioinformatics challenges. In

addition, in many cases, the tumor DNA and the germline

DNA of the patient are both sequenced for a better bioinfor-

matics workup. However, this strategy may change due to

improved pipelines and databases for unpaired samples. One

suggestion to circumvent such bottlenecks of data handling

is the WGS postsequencing retrieval of coding regions for

initial analysis. As the performance of a vendor’s exome kit

can only be guaranteed in targeted regions, some exclusion

of off-target reads is already performed by the sequencing

centers or bioinformaticians. In some aspects, RNA tran-

scriptome sequencing also offers a direct alternative to WES,

as (i) it covers the coding regions, (ii) it is feasible for the

detection of somatic mutations, (iii) RNA sequencing pro-

vides additional information from the transcriptional profile

of the malignancy, and (iv) the workflow is simple and

requires a small amount of input material. However, the

additional information provided by DNA sequencing, such as

CNVs and allele frequency distributions, is partly or com-

pletely lost.

Short-read sequencing has been a huge market, but this

may change, as several opportunities arise from long-read

sequencing, such as identifying chromosomal translocations

and chromosomal phasing of variants. In addition, short

reads give rise to problems with sequencing of repetitive

stretches of DNA and structural variants, and generally fail

in detecting breakpoints of chromosomal translocations. In

the next decade, we will probably experience a revival of

long-read sequencing in the form of 3rd generation sequenc-

ing such as that provided by Oxford Nanopore Technologies

(Oxford, UK).

Concluding remarks

The replacement of WES by sequencing of the full genome

seems inevitable. This “return” to whole genome sequencing

represents a full circle, with sequencing of the first cancer

genome as the starting point of the exome sequencing revo-

lution, and its early signs of senescence a decade after the

first WES studies. The adept usage of WES and WGS to

chart oncogenic drivers led to the adoption of figurative jar-

gon, describing the landscapes of malignancies and capturing

researchers’ perception of new possibilities and its returns.

Undoubtedly, WGS will steadily be implemented in routine

diagnostics in the next decade.

Author contributions

MCH drafted the paper and figures based on a shared initia-

tive, and all authors contributed substantially to the final

manuscript.

Disclosures

MCH is part-time researcher and independent consultant in

NGS and bioinformatics. TH is part-owner of MLL Munich

Leukemia Laboratory. CGN has nothing to disclose.

Acknowledgements

The authors would like to thank Dina Mohyeldeen, MD,

and Vickie S. Kristensen for their critical view of the final

draft.

Review

376 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

125

References

Alves, J.M. & Posada, D. (2018) Sensitivity to

sequencing depth in single-cell cancer genomics.

Genome Medicine, 10, 29.

Bainbridge, M.N., Wang, M., Burgess, D.L., Kovar,

C., Rodesch, M.J., D’Ascenzo, M., Kitzman, J.,

Wu, Y.-Q., Newsham, I., Richmond, T.A., Jed-

deloh, J.A., Muzny, D., Albert, T.J. & Gibbs,

R.A. (2010) Whole exome capture in solution

with 3 Gbp of data. Genome Biology, 11, R62.

Baron, C.S., Kester, L., Klaus, A., Boisset, J.C.,

Thambyrajah, R., Yvernogeau, L., Kouskoff, V.,

Lacaud, G., van Oudenaarden, A. & Robin, C.

(2018) Single-cell transcriptomics reveal the

dynamic of haematopoietic stem cell production

in the aorta. Nature Communications, 9, 2517.

Belkadi, A., Bolze, A., Itan, Y., Cobat, A., Vincent,

Q.B., Antipenko, A., Shang, L., Boisson, B.,

Casanova, J.-L. & Abel, L. (2015) Whole-gen-

ome sequencing is more powerful than whole-

exome sequencing for detecting exome variants.

Proceedings of the National Academy of Sciences

of the United States of America, 112, 5473–5478.

Berndt, S.I., Skibola, C.F., Joseph, V., Camp, N.J.,

Nieters, A., Wang, Z., Cozen, W., Monnereau,

A., Wang, S.S., Kelly, R.S., Lan, Q., Teras, L.R.,

Chatterjee, N., Chung, C.C., Yeager, M., Brooks-

Wilson, A.R., Hartge, P., Purdue, M.P., Bir-

mann, B.M., Armstrong, B.K., Cocco, P., Zhang,

Y., Severi, G., Zeleniuch-Jacquotte, A., Lawrence,

C., Burdette, L., Yuenger, J., Hutchinson, A.,

Jacobs, K.B., Call, T.G., Shanafelt, T.D., Novak,

A.J., Kay, N.E., Liebow, M., Wang, A.H.,

Smedby, K.E., Adami, H.-O., Melbye, M., Gli-

melius, B., Chang, E.T., Glenn, M., Curtin, K.,

Cannon-Albright, L.A., Jones, B., Diver, W.R.,

Link, B.K., Weiner, G.J., Conde, L., Bracci,

P.M., Riby, J., Holly, E.A., Smith, M.T., Jackson,

R.D., Tinker, L.F., Benavente, Y., Becker, N.,

Boffetta, P., Brennan, P., Foretova, L., Maynadie,

M., McKay, J., Staines, A., Rabe, K.G., Achen-

bach, S.J., Vachon, C.M., Goldin, L.R., Strom,

S.S., Lanasa, M.C., Spector, L.G., Leis, J.F., Cun-

ningham, J.M., Weinberg, J.B., Morrison, V.A.,

Caporaso, N.E., Norman, A.D., Linet, M.S., De

Roos, A.J., Morton, L.M., Severson, R.K., Riboli,

E., Vineis, P., Kaaks, R., Trichopoulos, D.,

Masala, G., Weiderpass, E., Chirlaque, M.-D.,

Vermeulen, R.C.H., Travis, R.C., Giles, G.G.,

Albanes, D., Virtamo, J., Weinstein, S., Clavel,

J., Zheng, T., Holford, T.R., Offit, K., Zelenetz,

A., Klein, R.J., Spinelli, J.J., Bertrand, K.A.,

Laden, F., Giovannucci, E., Kraft, P., Kricker, A.,

Turner, J., Vajdic, C.M., Ennas, M.G., Ferri,

G.M., Miligi, L., Liang, L., Sampson, J., Crouch,

S., Park, J.-H., North, K.E., Cox, A., Snowden,

J.A., Wright, J., Carracedo, A., Lopez-Otin, C.,

Bea, S., Salaverria, I., Martin-Garcia, D., Campo,

E., Fraumeni, J.F., de Sanjose, S., Hjalgrim, H.,

Cerhan, J.R., Chanock, S.J., Rothman, N. & Sla-

ger, S.L. (2013) Genome-wide association study

identifies multiple risk loci for chronic lympho-

cytic leukemia. Nature Genetics, 45, 868–876.

Best, K., Oakes, T., Heather, J.M., Shawe-Taylor, J.

& Chain, B. (2015) Computational analysis of

stochastic heterogeneity in PCR amplification

efficiency revealed by single molecule barcoding.

Scientific Reports, 5, 14629.

Bolli, N., Avet-Loiseau, H., Wedge, D.C., Van Loo,

P., Alexandrov, L.B., Martincorena, I., Dawson,

K.J., Iorio, F., Nik-Zainal, S., Bignell, G.R., Hin-

ton, J.W., Li, Y., Tubio, J.M.C., McLaren, S.,

O’Meara, S., Butler, A.P., Teague, J.W., Mudie,

L., Anderson, E., Rashid, N., Tai, Y.-T., Sham-

mas, M.A., Sperling, A.S., Fulciniti, M., Richard-

son, P.G., Parmigiani, G., Magrangeas, F.,

Minvielle, S., Moreau, P., Attal, M., Facon, T.,

Futreal, P.A., Anderson, K.C., Campbell, P.J. &

Munshi, N.C. (2014) Heterogeneity of genomic

evolution and mutational profiles in multiple

myeloma. Nature Communications, 5, 2997.

Braggio, E., Egan, J.B., Fonseca, R. & Stewart, A.K.

(2013) Lessons from next-generation sequencing

analysis in hematological malignancies. Blood

Cancer Journal, 3, e127.

Cancer Genome Atlas Research Network; Ley, T.J.,

Miller, C., Ding, L., Raphael, B.J., Mungall, A.J.,

Robertson, A., Hoadley, K., Triche, T.J. Jr,

Laird, P.W., Baty, J.D., Fulton, L.L., Fulton, R.,

Heath, S.E., Kalicki-Veizer, J., Kandoth, C.,

Klco, J.M., Koboldt, D.C., Kanchi, K.L., Kulka-

rni, S., Lamprecht, T.L., Larson, D.E., Lin, L.,

Lu, C., McLellan, M.D., McMichael, J.F., Payton,

J., Schmidt, H., Spencer, D.H., Tomasson, M.H.,

Wallis, J.W., Wartman, L.D., Watson, M.A.,

Welch, J., Wendl, M.C., Ally, A., Balasundaram,

M., Birol, I., Butterfield, Y., Chiu, R., Chu, A.,

Chuah, E., Chun, H.J., Corbett, R., Dhalla, N.,

Guin, R., He, A., Hirst, C., Hirst, M., Holt,

R.A., Jones, S., Karsan, A., Lee, D., Li, H.I.,

Marra, M.A., Mayo, M., Moore, R.A., Mungall,

K., Parker, J., Pleasance, E., Plettner, P., Schein,

J., Stoll, D., Swanson, L., Tam, A., Thiessen, N.,

Varhol, R., Wye, N., Zhao, Y., Gabriel, S., Getz,

G., Sougnez, C., Zou, L., Leiserson, M.D., Van-

din, F., Wu, H.T., Applebaum, F., Baylin, S.B.,

Akbani, R., Broom, B.M., Chen, K., Motter,

T.C., Nguyen, K., Weinstein, J.N., Zhang, N.,

Ferguson, M.L., Adams, C., Black, A., Bowen, J.,

Gastier-Foster, J., Grossman, T., Lichtenberg, T.,

Wise, L., Davidsen, T., Demchok, J.A., Shaw,

K.R., Sheth, M., Sofia, H.J., Yang, L., Downing,

J.R. & Eley, G. (2013) Genomic and epigenomic

landscapes of adult de novo acute myeloid leu-

kemia. New England Journal of Medicine, 368,

2059–2074.

Catlin, S.N., Busque, L., Gale, R.E., Guttorp, P. &

Abkowitz, J.L. (2011) The replication rate of

human hematopoietic stem cells in vivo. Blood,

117, 4460–4466.

Chapman, M.A., Lawrence, M.S., Keats, J.J., Cibul-

skis, K., Sougnez, C., Schinzel, A.C., Harview,

C.L., Brunet, J.-P., Ahmann, G.J., Adli, M.,

Anderson, K.C., Ardlie, K.G., Auclair, D., Baker,

A., Bergsagel, P.L., Bernstein, B.E., Drier, Y.,

Fonseca, R., Gabriel, S.B., Hofmeister, C.C.,

Jagannath, S., Jakubowiak, A.J., Krishnan, A.,

Levy, J., Liefeld, T., Lonial, S., Mahan, S.,

Mfuko, B., Monti, S., Perkins, L.M., Onofrio, R.,

Pugh, T.J., Rajkumar, S.V., Ramos, A.H., Siegel,

D.S., Sivachenko, A., Stewart, A.K., Trudel, S.,

Vij, R., Voet, D., Winckler, W., Zimmerman, T.,

Carpten, J., Trent, J., Hahn, W.C., Garraway,

L.A., Meyerson, M., Lander, E.S., Getz, G. &

Golub, T.R. (2011) Initial genome sequencing

and analysis of multiple myeloma. Nature, 471,

467–472.

Choi, M., Scholl, U.I., Ji, W., Liu, T., Tikhonova,

I.R., Zumbo, P., Nayir, A., Bakkalo!glu, A., €Ozen,

S., Sanjad, S., Nelson-Williams, C., Farhi, A.,

Mane, S. & Lifton, R.P. (2009) Genetic diagnosis

by whole exome capture and massively parallel

DNA sequencing. Proceedings of the National

Academy of Sciences of the United States of Amer-

ica, 106, 19096–19101.

Chong, J.X., Buckingham, K.J., Jhangiani, S.N.,

Boehm, C., Sobreira, N., Smith, J.D., Harrell,

T.M., McMillin, M.J., Wiszniewski, W., Gambin,

T., Coban Akdemir, Z.H., Doheny, K., Scott,

A.F., Avramopoulos, D., Chakravarti, A.,

Hoover-Fong, J., Mathews, D., Witmer, P.D.,

Ling, H., Hetrick, K., Watkins, L., Patterson,

K.E., Reinier, F., Blue, E., Muzny, D., Kircher,

M., Bilguvar, K., L#opez-Gir#aldez, F., Sutton,

V.R., Tabor, H.K., Leal, S.M., Gunel, M., Mane,

S., Gibbs, R.A., Boerwinkle, E., Hamosh, A.,

Shendure, J., Lupski, J.R., Lifton, R.P., Valle, D.,

Nickerson, D.A. & Bamshad, M.J. (2015) The

genetic basis of mendelian phenotypes: discover-

ies, challenges, and opportunities. American

Journal of Human Genetics, 97, 199–215.

Churchman, M.L., Qian, M., Te Kronnie, G.,

Zhang, R., Yang, W., Zhang, H., Lana, T.,

Tedrick, P., Baskin, R., Verbist, K., Peters, J.L.,

Devidas, M., Larsen, E., Moore, I.M., Gu, Z.,

Qu, C., Yoshihara, H., Porter, S.N., Pruett-

Miller, S.M., Wu, G., Raetz, E., Martin, P.L.,

Bowman, W.P., Winick, N., Mardis, E., Fulton,

R., Stanulla, M., Evans, W.E., Relling, M.V., Pui,

C.H., Hunger, S.P., Loh, M.L., Handgretinger,

R., Nichols, K.E., Yang, J.J. & Mullighan, C.G.

(2018) Germline genetic IKZF1 variation and

predisposition to childhood acute lymphoblastic

leukemia. Cancer Cell, 33, e938.

Crassini, K., Stevenson, W.S., Mulligan, S.P. &

Best, O.G. (2019) Molecular pathogenesis of

chronic lymphocytic leukaemia. British Journal

of Haematology, 186, 668–684.

Crowther-Swanepoel, D., Qureshi, M., Dyer, M.J.,

Matutes, E., Dearden, C., Catovsky, D. & Houl-

ston, R.S. (2009) Genetic variation in CXCR4

and risk of chronic lymphocytic leukemia.

Blood, 114, 4843–4846.

Crowther-Swanepoel, D., Broderick, P., Di Ber-

nardo, M.C., Dobbins, S.E., Torres, M., Man-

souri, M., Ruiz-Ponte, C., Enjuanes, A.,

Rosenquist, R., Carracedo, A., Jurlander, J.,

Campo, E., Juliusson, G., Montserrat, E.,

Smedby, K.E., Dyer, M.J.S., Matutes, E., Dear-

den, C., Sunter, N.J., Hall, A.G., Mainou-Fowler,

T., Jackson, G.H., Summerfield, G., Harris, R.J.,

Review

ª 2019 British Society for Haematology and John Wiley & Sons Ltd 377British Journal of Haematology, 2020, 188, 367–382

126

Pettitt, A.R., Allsup, D.J., Bailey, J.R., Pratt, G.,

Pepper, C., Fegan, C., Parker, A., Oscier, D.,

Allan, J.M., Catovsky, D. & Houlston, R.S.

(2010) Common variants at 2q37.3, 8q24.21,

15q21.3 and 16q24.1 influence chronic lympho-

cytic leukemia risk. Nature Genetics, 42, 132–

136.

Ding, L.-W., Sun, Q.-Y., Tan, K.-T., Chien, W.,

Thippeswamy, A.M., Eng Juh Yeoh, A., Kawa-

mata, N., Nagata, Y., Xiao, J.-F., Loh, X.-Y., Lin,

D.-C., Garg, M., Jiang, Y.-Y., Xu, L., Lim, S.-L.,

Liu, L.-Z., Madan, V., Sanada, M., Fern#andez,

L.T., Preethi, H., Lill, M., Kantarjian, H.M.,

Kornblau, S.M., Miyano, S., Liang, D.-C.,

Ogawa, S., Shih, L.-Y., Yang, H. & Koeffler,

H.P. (2017) Mutational landscape of pediatric

acute lymphoblastic leukemia. Cancer Research,

77, 390–400.

Fromer, M., Moran, J.L., Chambert, K., Banks, E.,

Bergen, S.E., Ruderfer, D.M., Handsaker, R.E.,

McCarroll, S.A., O’Donovan, M.C., Owen, M.J.,

Kirov, G., Sullivan, P.F., Hultman, C.M., Sklar,

P. & Purcell, S.M. (2012) Discovery and statisti-

cal genotyping of copy-number variation from

whole-exome sequencing depth. American Jour-

nal of Human Genetics, 91, 597–607.

Gawad, C., Koh, W. & Quake, S.R. (2014) Dissect-

ing the clonal origins of childhood acute lym-

phoblastic leukemia by single-cell genomics.

Proceedings of the National Academy of Sciences

of the United States of America, 111, 17947–

17952.

Gawad, C., Koh, W. & Quake, S.R. (2016) Single-

cell genome sequencing: current state of the

science. Nature Reviews Genetics, 17, 175–188.

Genomes Project Consortioum; Auton, A., Brooks,

L.D., Durbin, R.M., Garrison, E.P., Kang, H.M.,

Korbel, J.O., Marchini, J.L., McCarthy, S.,

McVean, G.A. & Abecasis, G.R. (2015) A global

reference for human genetic variation. Nature,

526, 68–74.

Genovese, G., Kahler, A.K., Handsaker, R.E., Lind-

berg, J., Rose, S.A., Bakhoum, S.F., Chambert,

K., Mick, E., Neale, B.M., Fromer, M., Purcell,

S.M., Svantesson, O., Land#en, M., H€oglund, M.,

Lehmann, S., Gabriel, S.B., Moran, J.L., Lander,

E.S., Sullivan, P.F., Sklar, P., Gr€onberg, H.,

Hultman, C.M. & McCarroll, S.A. (2014) Clonal

hematopoiesis and blood-cancer risk inferred

from blood DNA sequence. New England Jour-

nal of Medicine, 371, 2477–2487.

Gerlinger, M., Rowan, A.J., Horswell, S., Math, M.,

Larkin, J., Endesfelder, D., Gronroos, E., Marti-

nez, P., Matthews, N., Stewart, A., Tarpey, P.,

Varela, I., Phillimore, B., Begum, S., McDonald,

N.Q., Butler, A., Jones, D., Raine, K., Latimer,

C., Santos, C.R., Nohadani, M., Eklund, A.C.,

Spencer-Dene, B., Clark, G., Pickering, L.,

Stamp, G., Gore, M., Szallasi, Z., Downward, J.,

Futreal, P.A. & Swanton, C. (2012) Intratumor

heterogeneity and branched evolution revealed

by multiregion sequencing. New England Journal

of Medicine, 366, 883–892.

Gnirke, A., Melnikov, A., Maguire, J., Rogov, P.,

LeProust, E.M., Brockman, W., Fennell, T.,

Giannoukos, G., Fisher, S., Russ, C., Gabriel, S.,

Jaffe, D.B., Lander, E.S. & Nusbaum, C. (2009)

Solution hybrid selection with ultra-long

oligonucleotides for massively parallel targeted

sequencing. Nature Biotechnology, 27, 182–189.

Goldin, L.R., McMaster, M.L. & Caporaso, N.E.

(2013) Precursors to lymphoproliferative malig-

nancies. Cancer Epidemiology, Biomarkers &

Prevention, 22, 533–539.

Goryca, K., Kulecka, M., Paziewska, A., Dab-

rowska, M., Grzelak, M., Skrzypczak, M., Ginal-

ski, K., Mroz, A., Rutkowski, A., Paczkowska,

K., Mikula, M. & Ostrowski, J. (2018) Exome

scale map of genetic alterations promoting

metastasis in colorectal cancer. BMC Genetics,

19, 85.

Haferlach, T., Nagata, Y., Grossmann, V., Okuno,

Y., Bacher, U., Nagae, G., Schnittger, S., Sanada,

M., Kon, A., Alpermann, T., Yoshida, K., Roller,

A., Nadarajah, N., Shiraishi, Y., Shiozawa, Y.,

Chiba, K., Tanaka, H., Koeffler, H.P., Klein, H.-

U., Dugas, M., Aburatani, H., Kohlmann, A.,

Miyano, S., Haferlach, C., Kern, W. & Ogawa, S.

(2014) Landscape of genetic lesions in 944

patients with myelodysplastic syndromes. Leuke-

mia, 28, 241–247.

Hansen, M.C., Nyvold, C.G., Roug, A.S., Kjeldsen,

E., Villesen, P., Nederby, L. & Hokland, P.

(2015) Nature and nurture: a case of transcend-

ing haematological pre-malignancies in a pair of

monozygotic twins adding possible clues on the

pathogenesis of B-cell proliferations. British

Journal of Haematology, 169, 391–400.

Hansen, M.C., Nederby, L., Kjeldsen, E., Petersen,

M.A., Ommen, H.B. & Hokland, P. (2018) Case

report: exome sequencing identifies T-ALL with

myeloid features as a IKZF1-struck early precur-

sor T-cell malignancy. Leukemia Research

Reports, 9, 1–4.

Hansen, M.C., C#edile, O., Ludvigsen, M., Kjeldsen,

E., Møller, P.L., Abildgaard, N., Hokland, P. &

Nyvold, C.G. (2019) Integrating detection of

copy neutral chromosomal losses in a clinical

setting in leukemia and lymphoma by means of

allelic imbalance and read depth ratio compar-

ison. bioRxiv, https://doi.org/10.1101/590752

Houlston, R.S., Sellick, G., Yuille, M., Matutes, E.

& Catovsky, D. (2003) Causation of chronic

lymphocytic leukemia–insights from familial dis-

ease. Leukemia Research, 27, 871–876.

Ibanez, M., Carbonell-Caballero, J., Garcia-Alonso,

L., Such, E., Jimenez-Almazan, J., Vidal, E., Bar-

rag#an, E., L#opez-Pav#ıa, M., LLop, M., Mart#ın, I.

& G#omez-Segu#ı, I. (2016) The mutational land-

scape of acute promyelocytic leukemia reveals

an interacting network of co-occurrences and

recurrent mutations. PLoS ONE, 11, e0148346.

Jaiswal, S., Fontanillas, P., Flannick, J., Manning,

A., Grauman, P.V., Mar, B.G., Lindsley, R.C.,

Mermel, C.H., Burtt, N., Chavez, A., Higgins,

J.M., Moltchanov, V., Kuo, F.C., Kluk, M.J.,

Henderson, B., Kinnunen, L., Koistinen, H.A.,

Ladenvall, C., Getz, G., Correa, A., Banahan,

B.F., Gabriel, S., Kathiresan, S., Stringham,

H.M., McCarthy, M.I., Boehnke, M.,

Tuomilehto, J., Haiman, C., Groop, L., Atzmon,

G., Wilson, J.G., Neuberg, D., Altshuler, D. &

Ebert, B.L. (2014) Age-related clonal hematopoi-

esis associated with adverse outcomes. New Eng-

land Journal of Medicine, 371, 2488–2498.

Jan, M., Snyder, T.M., Corces-Zimmerman, M.R.,

Vyas, P., Weissman, I.L., Quake, S.R. & Majeti,

R. (2012) Clonal evolution of preleukemic

hematopoietic stem cells precedes human acute

myeloid leukemia. Science Translational Medi-

cine, 4, 149ra118.

Jongen-Lavrencic, M., Grob, T., Hanekamp, D.,

Kavelaars, F.G., al Hinai, A., Zeilemaker, A.,

Erpelinck-Verschueren, C.A.J., Gradowska, P.L.,

Meijer, R., Cloos, J., Biemond, B.J., Graux, C.,

van Marwijk Kooy, M., Manz, M.G., Pabst, T.,

Passweg, J.R., Havelange, V., Ossenkoppele,

G.J., Sanders, M.A., Schuurhuis, G.J., L€owen-

berg, B. & Valk, P.J.M. (2018) Molecular mini-

mal residual disease in acute myeloid leukemia.

New England Journal of Medicine, 378, 1189–

1199.

Jordan, C.T. (2007) The leukemic stem cell. Best

Practice & Research Clinical Haematology, 20,

13–18.

Karczewski, K.J., Francioli, L.C., Tiao, G., Cum-

mings, B.B., Alf€oldi, J., Wang, Q., Collins, R.L.,

Laricchia, K.M., Ganna, A., Birnbaum, D.P.,

Gauthier, L.D., Brand, H., Solomonson, M.,

Watts, N.A., Rhodes, D., Singer-Berk, M., Seaby,

E.G., Kosmicki, J.A., Walters, R.K., Tashman,

K., Farjoun, Y., Banks, E., Poterba, T., Wang,

A., Seed, C., Whiffin, N., Chong, J.X., Samocha,

K.E., Pierce-Hoffman, E., Zappala, Z., O’Don-

nell-Luria, A.H., Vallabh Minikel, E., Weisburd,

B., Lek, M., Ware, J.S., Vittal, C., Armean, I.M.,

Bergelson, L., Cibulskis, K., Connolly, K.M.,

Covarrubias, M., Donnelly,S., Ferriera, S., Gab-

riel, S., Gentry, J., Gupta, N., Jeandet, T.,

Kaplan, D., Llanwarne, C., Munshi, R., Novod,

S., Petrillo, N., Roazen, D., Ruano-Rubio, V.,

Saltzman, A., Schleicher, M., Soto, J., Tibbetts,

K., Tolonen, C., Wade, G., Talkowski, M.E., The

Genome Aggregation Database Consortium,

Neale, B.M., Daly, M.J. & MacArthur, D.G.

(2019) Variation across 141,456 human exomes

and genomes reveals the spectrum of loss-of-

function intolerance across human protein-cod-

ing genes. bioRxiv, 531210. https://doi.org/10.

1101/531210.

Kong, S.W., Lee, I.H., Liu, X., Hirschhorn, J.N.

& Mandl, K.D. (2018) Measuring coverage and

accuracy of whole-exome sequencing in clinical

context. Genetics in Medicine, 20, 1617–1626.

Lai, S., Huang, W., Xu, Y., Jiang, M., Chen, H.,

Cheng, C., Lu, Y., Huang, H., Guo, G. & Han,

X. (2018) Comparative transcriptomic analysis

of hematopoietic system between human and

mouse by Microwell-seq. Cell Discovery, 4, 34.

Lander, E.S. & Waterman, M.S. (1988) Genomic

mapping by fingerprinting random clones: a

mathematical analysis. Genomics, 2, 231–239.

Lander, E.S., Linton, L.M., Birren, B., Nusbaum,

C., Zody, M.C., Baldwin, J., Devon, K., Dewar,

K., Doyle, M., FitzHugh, W. & Funke, R. (2001)

Review

378 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

127

Initial sequencing and analysis of the human

genome. Nature, 409, 860–921.

Lawrence, M.S., Stojanov, P., Mermel, C.H.,

Robinson, J.T., Garraway, L.A., Golub, T.R.,

Meyerson, M., Gabriel, S.B., Lander, E.S. &

Getz, G. (2014) Discovery and saturation analy-

sis of cancer genes across 21 tumour types. Nat-

ure, 505, 495–501.

Lek, M., Karczewski, K.J., Minikel, E.V., Samocha,

K.E., Banks, E., Fennell, T., O’Donnell-Luria,

A.H., Ware, J.S., Hill, A.J., Cummings, B.B.,

Tukiainen, T., Birnbaum, D.P., Kosmicki, J.A.,

Duncan, L.E., Estrada, K., Zhao, F., Zou, J.,

Pierce-Hoffman, E., Berghout, J., Cooper, D.N.,

Deflaux, N., DePristo, M., Do, R., Flannick, J.,

Fromer, M., Gauthier, L., Goldstein, J., Gupta,

N., Howrigan, D., Kiezun, A., Kurki, M.I.,

Moonshine, A.L., Natarajan, P., Orozco, L.,

Peloso, G.M., Poplin, R., Rivas, M.A., Ruano-

Rubio, V., Rose, S.A., Ruderfer, D.M., Shakir,

K., Stenson, P.D., Stevens, C., Thomas, B.P.,

Tiao, G., Tusie-Luna, M.T., Weisburd, B., Won,

H.-H., Yu, D., Altshuler, D.M., Ardissino, D.,

Boehnke, M., Danesh, J., Donnelly, S., Elosua,

R., Florez, J.C., Gabriel, S.B., Getz, G., Glatt,

S.J., Hultman, C.M., Kathiresan, S., Laakso, M.,

McCarroll, S., McCarthy, M.I., McGovern, D.,

McPherson, R., Neale, B.M., Palotie, A., Purcell,

S.M., Saleheen, D., Scharf, J.M., Sklar, P., Sulli-

van, P.F., Tuomilehto, J., Tsuang, M.T., Wat-

kins, H.C., Wilson, J.G., Daly, M.J. &

MacArthur, D.G. (2016) Analysis of protein-

coding genetic variation in 60,706 humans. Nat-

ure, 536, 285–291.

Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLel-

lan, M.D., Chen, K., Dooling, D., Dunford-

Shore, B.H., McGrath, S., Hickenbotham, M.,

Cook, L., Abbott, R., Larson, D.E., Koboldt,

D.C., Pohl, C., Smith, S., Hawkins, A., Abbott,

S., Locke, D., Hillier, L.D.W., Miner, T., Fulton,

L., Magrini, V., Wylie, T., Glasscock, J., Conyers,

J., Sander, N., Shi, X., Osborne, J.R., Minx, P.,

Gordon, D., Chinwalla, A., Zhao, Y., Ries, R.E.,

Payton, J.E., Westervelt, P., Tomasson, M.H.,

Watson, M., Baty, J., Ivanovich, J., Heath, S.,

Shannon, W.D., Nagarajan, R., Walter, M.J.,

Link, D.C., Graubert, T.A., DiPersio, J.F. & Wil-

son, R.K. (2008) DNA sequencing of a cytoge-

netically normal acute myeloid leukaemia

genome. Nature, 456, 66–72.

Lin, B., Wang, J. & Cheng, Y. (2008) Recent

patents and advances in the next-generation

sequencing technologies. Recent Patents on

Biomedical Engineeringe, 2008, 60–67.

Liu, Y., Easton, J., Shao, Y., Maciaszek, J., Wang,

Z., Wilkinson, M.R., McCastlain, K., Edmonson,

M., Pounds, S.B., Shi, L., Zhou, X., Ma, X., Sio-

son, E., Li, Y., Rusch, M., Gupta, P., Pei, D.,

Cheng, C., Smith, M.A., Auvil, J.G., Gerhard,

D.S., Relling, M.V., Winick, N.J., Carroll, A.J.,

Heerema, N.A., Raetz, E., Devidas, M., Willman,

C.L., Harvey, R.C., Carroll, W.L., Dunsmore,

K.P., Winter, S.S., Wood, B.L., Sorrentino, B.P.,

Downing, J.R., Loh, M.L., Hunger, S.P., Zhang,

J. & Mullighan, C.G. (2017) The genomic

landscape of pediatric and young adult T-lineage

acute lymphoblastic leukemia. Nature Genetics,

49, 1211–1218.

Lohr, J.G., Stojanov, P., Lawrence, M.S., Auclair,

D., Chapuy, B., Sougnez, C., Cruz-Gordillo, P.,

Knoechel, B., Asmann, Y.W., Slager, S.l., Novak,

A.J., Dogan, A., Ansell, S.M., Link, B.K., Zou,

L., Gould, J., Saksena, G., Stransky, N., Rangel-

Escareno, C., Fernandez-Lopez, J.C., Hidalgo-

Miranda, A., Melendez-Zajgla, J., Hernandez-

Lemus, E., Schwarz-Cruz y Celis, A., Imaz-

Rosshandler, I., Ojesina, A.I., Jung, J., Peda-

mallu, C.S., Lander, E.S., Habermann, T.M.,

Cerhan, J.R., Shipp, M.A., Getz, G. & Golub,

T.R. (2012) Discovery and prioritization of

somatic mutations in diffuse large B-cell lym-

phoma (DLBCL) by whole-exome sequencing.

Proceedings of the National Academy of Sciences

of the United States of America, 109, 3879–3884.

Lohr, J.G., Kim, S., Gould, J., Knoechel, B., Drier,

Y., Cotton, M.J., Gray, D., Birrer, N., Wong, B.,

Ha, G., Zhang, C.-Z., Guo, G., Meyerson, M.,

Yee, A.J., Boehm, J.S., Raje, N. & Golub, T.R.

(2016) Genetic interrogation of circulating mul-

tiple myeloma cells at single-cell resolution.

Science Translational Medicine, 8, 363ra147.

MacConaill, L.E., Burns, R.T., Nag, A., Coleman,

H.A., Slevin, M.K., Giorda, K., Light, M., Lai,

K., Jarosz, M., McNeill, M.S., Ducar, M.D.,

Meyerson, M. & Thorner, A.R. (2018) Unique,

dual-indexed sequencing adapters with UMIs

effectively eliminate index cross-talk and signifi-

cantly improve sensitivity of massively parallel

sequencing. BMC Genomics, 19, 30.

Malmberg, E.B., Stahlman, S., Rehammar, A.,

Samuelsson, T., Alm, S.J., Kristiansson, E., Abra-

hamsson, J., Garelius, H., Pettersson, L., Ehin-

ger, M., Palmqvist, L. & Fogelstrand, L. (2017)

Patient-tailored analysis of minimal residual dis-

ease in acute myeloid leukemia using next-gen-

eration sequencing. European Journal of

Haematology, 98, 26–37.

Mannelli, F., Ponziani, V., Bencini, S., Bonetti,

M.I., Benelli, M., Cutini, I., Gianfaldoni, G.,

Scappini, B., Pancani, F., Piccini, M. & Rondelli,

T. (2017) CEBPA-double-mutated acute myeloid

leukemia displays a unique phenotypic profile: a

reliable screening method and insight into bio-

logical features. Haematologica, 102, 529–540.

Mardis, E.R., Ding, L., Dooling, D.J., Larson, D.E.,

McLellan, M.D., Chen, K., Koboldt, D.C., Ful-

ton, R.S., Delehaunty, K.D., McGrath, S.D., Ful-

ton, L.A., Locke, D.P., Magrini, V.J., Abbott,

R.M., Vickery, T.L., Reed, J.S., Robinson, J.S.,

Wylie, T., Smith, S.M., Carmichael, L., Eldred,

J.M., Harris, C.C., Walker, J., Peck, J.B., Du, F.,

Dukes, A.F., Sanderson, G.E., Brummett, A.M.,

Clark, E., McMichael, J.F., Meyer, R.J., Schind-

ler, J.K., Pohl, C.S., Wallis, J.W., Shi, X., Lin, L.,

Schmidt, H., Tang, Y., Haipek, C., Wiechert,

M.E., Ivy, J.V., Kalicki, J., Elliott, G., Ries, R.E.,

Payton, J.E., Westervelt, P., Tomasson, M.H.,

Watson, M.A., Baty, J., Heath, S., Shannon,

W.D., Nagarajan, R., Link, D.C., Walter, M.J.,

Graubert, T.A., DiPersio, J.F., Wilson, R.K. &

Ley, T.J. (2009) Recurring mutations found by

sequencing an acute myeloid leukemia genome.

New England Journal of Medicine, 361, 1058–

1066.

Marke, R., van Leeuwen, F.N. & Scheijen, B.

(2018) The many faces of IKZF1 in B-cell pre-

cursor acute lymphoblastic leukemia. Haemato-

logica, 103, 565–574.

Mata, E., Diaz-Lopez, A., Martin-Moreno, A.M.,

Sanchez-Beato, M., Varela, I., Mestre, M.J., San-

tonja, C., Burgos, F., Men#arguez, J., Est#evez, M.,

Provencio, M., S#anchez-Espiridi#on, B., D#ıaz, E.,

Montalb#an, C., Piris, M.A. & Garc#ıa, J.F. (2017)

Analysis of the mutational landscape of classic

Hodgkin lymphoma identifies disease hetero-

geneity and potential therapeutic targets. Onco-

target, 8, 111386–111395.

Meggendorfer, M., Roller, A., Haferlach, T., Eder,

C., Dicker, F., Grossmann, V., Kohlmann, A.,

Alpermann, T., Yoshida, K., Ogawa, S., Koeffler,

H.P., Kern, W., Haferlach, C. & Schnittger, S.

(2012) SRSF2 mutations in 275 cases with

chronic myelomonocytic leukemia (CMML).

Blood, 120, 3080–3088.

Meynert, A.M., Bicknell, L.S., Hurles, M.E., Jack-

son, A.P. & Taylor, M.S. (2013) Quantifying sin-

gle nucleotide variant detection sensitivity in

exome sequencing. BMC Bioinformatics, 14, 195.

Meynert, A.M., Ansari, M., FitzPatrick, D.R. &

Taylor, M.S. (2014) Variant detection sensitivity

and biases in whole genome and exome

sequencing. BMC Bioinformatics, 15, 247.

Neumann, M., Coskun, E., Fransecky, L., Moch-

mann, L.H., Bartram, I., Farhadi Sartangi, N.,

Heesch, S., G€okbuget, N., Schwartz, S., Brandts,

C., Schlee, C., Haas, R., D€uhrsen, U.,

Griesshammer, M., D€ohner, H., Ehninger, G.,

Burmeister, T., Blau, O., Thiel, E., Hoelzer, D.,

Hofmann, W.-K. & Baldus, C.D. (2013a) FLT3

mutations in early T-cell precursor ALL charac-

terize a stem cell like leukemia and imply the

clinical use of tyrosine kinase inhibitors. PLoS

ONE, 8, e53190.

Neumann, M., Heesch, S., Schlee, C., Schwartz, S.,

Gokbuget, N., Hoelzer, D., Konstandin, N.P.,

Ksienzyk, B., Vosberg, S., Graf, A., Krebs, S.,

Blum, H., Raff, T., Bruggemann, M., Hofmann,

W.-K., Hecht, J., Bohlander, S.K., Greif, P.A. &

Baldus, C.D. (2013b) Whole-exome sequencing

in adult ETP-ALL reveals a high rate of

DNMT3A mutations. Blood, 121, 4749–4752.

Ng, S.B., Turner, E.H., Robertson, P.D., Flygare,

S.D., Bigham, A.W., Lee, C., Shaffer, T.,

Wong, M., Bhattacharjee, A., Eichler, E.E.,

Bamshad, M., Nickerson, D.A. & Shendure, J.

(2009) Targeted capture and massively parallel

sequencing of 12 human exomes. Nature, 461,

272–276.

Nikolaev, S.I., Santoni, F., Vannier, A., Falconnet,

E., Giarin, E., Basso, G., Hoischen, A., Veltman,

J.A., Groet, J., Nizetic, D. & Antonarakis, S.E.

(2013) Exome sequencing identifies putative dri-

vers of progression of transient myeloprolifera-

tive disorder to AMKL in infants with Down

syndrome. Blood, 122, 554–561.

Review

ª 2019 British Society for Haematology and John Wiley & Sons Ltd 379British Journal of Haematology, 2020, 188, 367–382

128

Pabst, T., Eyholzer, M., Haefliger, S., Schardt, J. &

Mueller, B.U. (2008) Somatic CEBPA mutations

are a frequent second event in families with

germline CEBPA mutations and familial acute

myeloid leukemia. Journal of Clinical Oncology,

26, 5088–5093.

Papaemmanuil, E., Cazzola, M., Boultwood, J.,

Malcovati, L., Vyas, P., Bowen, D., Pellagatti, A.,

Wainscoat, J.S., Hellstrom-Lindberg, E., Gamba-

corti-Passerini, C. & Godfrey, A.L. (2011)

Somatic SF3B1 mutation in myelodysplasia with

ring sideroblasts. New England Journal of Medi-

cine, 365, 1384–1395.

Papaemmanuil, E., Gerstung, M., Malcovati, L.,

Tauro, S., Gundem, G., Van Loo, P., Yoon, C.J.,

Ellis, P., Wedge, D.C., Pellagatti, A., Shlien, A.,

Groves, M.J., Forbes, S.A., Raine, K., Hinton, J.,

Mudie, L.J., McLaren, S., Hardy, C., Latimer, C.,

Della Porta, M.G., O’Meara, S., Ambaglio, I.,

Galli, A., Butler, A.p, Walldin, G., Teague, J.W.,

Quek, L., Sternberg, A., Gambacorti-Passerini,

C., Cross, N.C.P., Green, A.R., Boultwood, J.,

Vyas, P., Hellstrom-Lindberg, E., Bowen, D.,

Cazzola, M., Stratton, M.R. & Campbell, P.J.

(2013) Clinical and biological implications of

driver mutations in myelodysplastic syndromes.

Blood, 122, 3616–3627; quiz 3699.

Papaemmanuil, E., Gerstung, M., Bullinger, L.,

Gaidzik, V.I., Paschka, P., Roberts, N.D., Potter,

N.E., Heuser, M., Thol, F., Bolli, N., Gundem,

G., Van Loo, P., Martincorena, I., Ganly, P.,

Mudie, L., McLaren, S., O’Meara, S., Raine, K.,

Jones, D.R., Teague, J.W., Butler, A.P., Greaves,

M.F., Ganser, A., D€ohner, K., Schlenk, R.F.,

D€ohner, H. & Campbell, P.J. (2016) Genomic

classification and prognosis in acute myeloid

leukemia. New England Journal of Medicine, 374,

2209–2221.

Parla, J.S., Iossifov, I., Grabill, I., Spector, M.S.,

Kramer, M. & McCombie, W.R. (2011) A com-

parative analysis of exome capture. Genome Biol-

ogy, 12, R97.

Pellegrino, M., Sciambi, A., Treusch, S., Dur-

ruthy-Durruthy, R., Gokhale, K., Jacob, J.,

Chen, T.X., Geis, J.A., Oldham, W., Matthews,

J., Kantarjian, H., Futreal, P.A., Patel, K,

Jones, K.W., Takahashi, K. & Eastburn, D.J.

(2018) High-throughput single-cell DNA

sequencing of acute myeloid leukemia tumors

with droplet microfluidics. Genome Research,

28, 1345–1352.

Quesada, V., Conde, L., Villamor, N., Ordonez,

G.R., Jares, P., Bassaganyas, L., Ramsay, A.J.,

Be$a, S., Pinyol, M., Mart#ınez-Trillos, A., L#opez-

Guerra, M., Colomer, D., Navarro, A., Bau-

mann, T., Aymerich, M., Rozman, M., Delgado,

J., Gin#e, E., Hern#andez, J.M., Gonz#alez-D#ıaz,

M., Puente, D.A., Velasco, G., Freije, J.M.P.,

Tub#ıo, J.M.C., Royo, R., Gelp#ı, J.L., Orozco, M.,

Pisano, D.G., Zamora, J., V#azquez, M., Valencia,

A., Himmelbauer, H., Bay#es, M., Heath, S., Gut,

M., Gut, I., Estivill, X., L#opez-Guillermo, A.,

Puente, X.S., Campo, E. & L#opez-Ot#ın, C.

(2011) Exome sequencing identifies recurrent

mutations of the splicing factor SF3B1 gene in

chronic lymphocytic leukemia. Nature Genetics,

44, 47–52.

Ramsay, A.J., Quesada, V., Foronda, M., Conde,

L., Martinez-Trillos, A., Villamor, N., Rodr#ıguez,

D., Kwarciak, A., Garabaya, C., Gallardo, M.,

L#opez-Guerra, M., L#opez-Guillermo, A., Puente,

X.S., Blasco, M.A., Campo, E. & L#opez-Ot#ın, C.

(2013) POT1 mutations cause telomere dysfunc-

tion in chronic lymphocytic leukemia. Nature

Genetics, 45, 526–530.

Reddy, A., Zhang, J., Davis, N.S., Moffitt, A.B.,

Love, C.L., Waldrop, A., Leppa, S., Pasanen, A.,

Meriranta, L., Karjalainen-Lindsberg, M.L. &

Nørgaard, P. (2017) Genetic and functional dri-

vers of diffuse large B cell lymphoma. Cell, 171,

e415.

Reichel, J., Chadburn, A., Rubinstein, P.G., Giu-

lino-Roth, L., Tam, W., Liu, Y., Gaiolla, R., Eng,

K., Brody, J., Inghirami, G., Carlo-Stella, C.,

Santoro, A., Rahal, D., Totonchy, J., Elemento,

O., Cesarman, E. & Roshal, M. (2015) Flow

sorting and exome sequencing reveal the onco-

genome of primary Hodgkin and Reed-Stern-

berg cells. Blood, 125, 1061–1072.

Rotunno, M., McMaster, M.L., Boland, J., Bass, S.,

Zhang, X., Burdett, L., Hicks, B., Ravichandran,

S., Luke, B.T., Yeager, M., Fontaine, L., Hyland,

P.l., Goldstein, A.M., Chanock, S.J., Caporaso,

N.E., Tucker, M.A. & Goldin, L.R. (2016) Whole

exome sequencing in families at high risk for

Hodgkin lymphoma: identification of a predis-

posing mutation in the KDR gene. Haematolog-

ica, 101, 853–860.

Roug, A.S., Hansen, M.C., Nederby, L. & Hokland,

P. (2014) Diagnosing and following adult

patients with acute myeloid leukaemia in the

genomic age. British Journal of Haematology,

167, 162–176.

Roy, S., Coldren, C., Karunamurthy, A., Kip, N.S.,

Klee, E.W., Lincoln, S.E., Leon, A., Pullamb-

hatla, M., Temple-Smolkin, R.L., Voelkerding,

K.V., Wang, C. & Carter, A.B. (2018) Standards

and guidelines for validating next-generation

sequencing bioinformatics pipelines: a joint rec-

ommendation of the association for molecular

pathology and the college of american patholo-

gists. The Journal of Molecular Diagnostics, 20,

4–27.

Rumi, E. & Cazzola, M. (2017) Advances in under-

standing the pathogenesis of familial myelopro-

liferative neoplasms. British Journal of

Haematology, 178, 689–698.

Sanger, F., Nicklen, S. & Coulson, A.R. (1977)

DNA sequencing with chain-terminating inhibi-

tors. Proceedings of the National Academy of

Sciences of the United States of America, 74,

5463–5467.

Sano, S., Wang, Y. & Walsh, K. (2018) Clonal

hematopoiesis and its impact on cardiovascular

disease. Circulation Journal, 83, 2–11.

Sanoja-Flores, L., Flores-Montero, J., Garces, J.J.,

Paiva, B., Puig, N., Garc#ıa-Mateo, A., Garc#ıa-

S#anchez, O., Corral-Mateos, A., Burgos, L.,

Blanco, E., Hern#andez-Mart#ın, J., Pontes, R.,

D#ıez-Campelo, M., Millacoy, P., Rodr#ıguez-

Otero, P., Prosper, F., Merino, J., Vidriales,

M.B., Garc#ıa-Sanz, R., Romero, A., Palomera, L.,

R#ıos-Tamayo, R., P#erez-Andr#es, M., Blanco, J.F.,

Gonz#alez, M., van Dongen, J.J.M., Durie, B.,

Mateos, M.V., San-Miguel, J. & Orfao, A.

(2018) Next generation flow for minimally-inva-

sive blood characterization of MGUS and multi-

ple myeloma at diagnosis based on circulating

tumor plasma cells (CTPC). Blood Cancer Jour-

nal, 8, 117.

Sathirapongsasuti, J.F., Lee, H., Horst, B.A., Brun-

ner, G., Cochran, A.J., Binder, S., Quackenbush,

J. & Nelson, S.F. (2011) Exome sequencing-

based copy-number variation and loss of

heterozygosity detection: ExomeCNV. Bioinfor-

matics, 27, 2648–2654.

Scales, M., Chubb, D., Dobbins, S.E., Johnson,

D.C., Li, N., Sternberg, M.J., Weinhold, N.,

Stein, C., Jackson, G., Davies, F.E., Walker, B.A.,

Wardell, C.P., Houlston, R.S. & Morgan, G.J.

(2017) Search for rare protein altering variants

influencing susceptibility to multiple myeloma.

Oncotarget, 8, 36203–36210.

Schmitz, R., Wright, G.W., Huang, D.W., Johnson,

C.A., Phelan, J.D., Wang, J.Q., Roulland, S.,

Kasbekar, M., Young, R.M., Shaffer, A.L., Hod-

son, D.J., Xiao, W., Yu, X., Yang, Y., Zhao, H.,

Xu, W., Liu, X., Zhou, B., Du, W., Chan, W.C.,

Jaffe, E.S., Gascoyne, R.D., Connors, J.M.,

Campo, E., Lopez-Guillermo, A., Rosenwald, A.,

Ott, G., Delabie, J., Rimsza, L.M., Tay Kuang

Wei, K., Zelenetz, A.D., Leonard, J.P., Bartlett,

N.L., Tran, B., Shetty, J., Zhao, Y., Soppet, D.R.,

Pittaluga, S., Wilson, W.H. & Staudt, L.M.

(2018) Genetics and pathogenesis of diffuse

large B-cell lymphoma. New England Journal of

Medicine, 378, 1396–1407.

Schwartzman, O., Savino, A.M., Gombert, M.,

Palmi, C., Cario, G., Schrappe, M., Eckert, C.,

von Stackelberg, A., Huang, J.-Y., Hameiri-

Grossman, M., Avigad, S., te Kronnie, G.,

Geron, I., Birger, Y., Rein, A., Zarfati, G., Fis-

cher, U., Mukamel, Z., Stanulla, M., Biondi, A.,

Cazzaniga, G., Vetere, A., Wagner, B.K., Chen,

Z., Chen, S.-J., Tanay, A., Borkhardt, A. &

Izraeli, S. (2017) Suppressors and activators of

JAK-STAT signaling at diagnosis and relapse of

acute lymphoblastic leukemia in Down syn-

drome. Proceedings of the National Academy of

Sciences of the United States of America, 114,

E4030–E4039.

Sena, J.A., Galotto, G., Devitt, N.P., Connick,

M.C., Jacobi, J.L., Umale, P.E., Vidali, L. & Bell,

C.J. (2018) Unique Molecular Identifiers reveal

a novel sequencing artefact with implications for

RNA-Seq based gene expression analysis. Scien-

tific Reports, 8, 13121.

Simonsen, A.T., Hansen, M.C., Kjeldsen, E., Mol-

ler, P.L., Hindkjaer, J.J., Hokland, P. & Agger-

holm, A. (2018) Systematic evaluation of signal-

to-noise ratio in variant detection from single

cell genome multiple displacement amplification

and exome sequencing. BMC Genomics, 19, 681.

Slager, S.L. & Kay, N.E. (2009) Familial chronic

lymphocytic leukemia: what does it mean to

Review

380 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

129

me? Clinical Lymphoma and Myeloma, 9, S194–

197.

Slager, S.L., Skibola, C.F., Di Bernardo, M.C.,

Conde, L., Broderick, P., McDonnell, S.K.,

Goldin, L.R., Croft, N., Holroyd, A., Harris, S.,

Riby, J., Serie, D.J., Kay, N.E., Call, T.G., Bracci,

P.M., Halperin, E., Lanasa, M.C., Cunningham,

J.M., Leis, J.F., Morrison, V.A., Spector, L.G.,

Vachon, C.M., Shanafelt, T.D., Strom, S.S.,

Camp, N.J., Weinberg, J.B., Matutes, E., Capo-

raso, N.E., Wade, R., Dyer, M.J.S., Dearden, C.,

Cerhan, J.R., Catovsky, D. & Houlston, R.S.

(2012) Common variation at 6p21.31 (BAK1)

influences the risk of chronic lymphocytic leuke-

mia. Blood, 120, 843–846.

Smith, M.L., Cavenagh, J.D., Lister, T.A. & Fitzgib-

bon, J. (2004) Mutation of CEBPA in familial

acute myeloid leukemia. New England Journal of

Medicine, 351, 2403–2407.

Speedy, H.E., Di Bernardo, M.C., Sava, G.P., Dyer,

M.J., Holroyd, A., Wang, Y., Sunter, N.J., Man-

souri, L., Juliusson, G., Smedby, K.E., Roos, G.,

Jayne, S., Majid, A., Dearden, C., Hall, A.G.,

Mainou-Fowler, T., Jackson, G.H., Summerfield,

G., Harris, R.J., Pettitt, A.R., Allsup, D.J., Bailey,

J.R., Pratt, G., Pepper, C., Fegan, C., Rosenquist,

R., Catovsky, D., Allan, J.M. & Houlston, R.S.

(2014) A genome-wide association study identi-

fies multiple susceptibility loci for chronic lym-

phocytic leukemia. Nature Genetics, 46, 56–60.

Steensma, D.P. (2018) Clinical consequences of

clonal hematopoiesis of indeterminate potential.

Blood Advances, 2, 3404–3410.

Steensma, D.P., Bejar, R., Jaiswal, S., Lindsley,

R.C., Sekeres, M.A., Hasserjian, R.P. & Ebert,

B.L. (2015) Clonal hematopoiesis of indetermi-

nate potential and its distinction from

myelodysplastic syndromes. Blood, 126, 9–16.

Tawana, K., Wang, J., Renneville, A., Bodor, C.,

Hills, R., Loveday, C., Savic, A., Van Delft,

F.W., Treleaven, J., Georgiades, P., Uglow, E.,

Asou, N., Uike, N., Debeljak, M., Jazbec, J.,

Ancliff, P., Gale, R., Thomas, X., Mialou, V.,

Dohner, K., Bullinger, L., Mueller, B., Pabst, T.,

Stelljes, M., Schlegelberger, B., Wozniak, E.,

Iqbal, S., Okosun, J., Araf, S., Frank, A.-K., Lau-

ridsen, F.B., Porse, B., Nerlov, C., Owen, C.,

Dokal, I., Gribben, J., Smith, M., Preudhomme,

C., Chelala, C., Cavenagh, J. & Fitzgibbon, J.

(2015) Disease evolution and outcomes in famil-

ial AML with germline CEBPA mutations.

Blood, 126, 1214–1223.

Tiacci, E., Trifonov, V., Schiavoni, G., Holmes, A.,

Kern, W., Martelli, M.P., Pucciarini, A., Bigerna,

B., Pacini, R., Wells, V.A. & Sportoletti, P.

(2011) BRAF mutations in hairy-cell leukemia.

New England Journal of Medicine, 364, 2305–

2315.

Tokheim, C.J., Papadopoulos, N., Kinzler, K.W.,

Vogelstein, B. & Karchin, R. (2016) Evaluating

the evaluation of cancer driver genes. Proceed-

ings of the National Academy of Sciences of the

United States of America, 113, 14330–14335.

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W.,

Mural, R.J., Sutton, G.G., Smith, H.O., Yandell,

M., Evans, C.A., Holt, R.A., Gocayne, J.D., Ama-

natides, P., Ballew, R.M., Huson, D.H., Wort-

man, J.R., Zhang, Q., Kodira, C.D., Zheng,

X.H., Chen, L., Skupski, M., Subramanian, G.,

Thomas, P.D., Zhang, J., Gabor Miklos, G.L.,

Nelson, C., Broder, S., Clark, A.G., Nadeau, J.,

McKusick, V.A., Zinder, N., Levine, A.J.,

Roberts, R.J., Simon, M., Slayman, C., Hunka-

piller, M., Bolanos, R., Delcher, A., Dew, I., Fas-

ulo, D., Flanigan, M., Florea, L., Halpern, A.,

Hannenhalli, S., Kravitz, S., Levy, S., Mobarry,

C., Reinert, K., Remington, K., Abu-Threideh,

J., Beasley, E., Biddick, K., Bonazzi, V., Brandon,

R., Cargill, M., Chandramouliswaran, I., Char-

lab, R., Chaturvedi, K., Deng, Z., Francesco,

V.D., Dunn, P., Eilbeck, K., Evangelista, C.,

Gabrielian, A.E., Gan, W., Ge, W., Gong, F., Gu,

Z., Guan, P., Heiman, T.J., Higgins, M.E., Ji, R.-

R., Ke, Z., Ketchum, K.A., Lai, Z., Lei, Y., Li, Z.,

Li, J., Liang, Y., Lin, X., Lu, F., Merkulov, G.V.,

Milshina, N., Moore, H.M., Naik, A.K., Nara-

yan, V.A., Neelam, B., Nusskern, D., Rusch,

D.B., Salzberg, S., Shao, W., Shue, B., Sun, J.,

Wang, Z.Y., Wang, A., Wang, X., Wang, J., Wei,

M.-H., Wides, R., Xiao, C., Yan, C., Yao, A., Ye,

J., Zhan, M., Zhang, W., Zhang, H., Zhao, Q.,

Zheng, L., Zhong, F., Zhong, W., Zhu, S.C.,

Zhao, S., Gilbert, D., Baumhueter, S., Spier, G.,

Carter, C., Cravchik, A., Woodage, T., Ali, F.,

An, H., Awe, A., Baldwin, D., Baden, H., Barn-

stead, M., Barrow, I., Beeson, K., Busam, D.,

Carver, A., Center, A., Cheng, M.L., Curry, L.,

Danaher, S., Davenport, L., Desilets, R., Dietz,

S., Dodson, K., Doup, L., Ferriera, S., Garg, N.,

Gluecksmann, A., Hart, B., Haynes, J., Haynes,

C., Heiner, C., Hladun, S., Hostin, D., Houck,

J., Howland, T., Ibegwam, C., Johnson, J.,

Kalush, F., Kline, L., Koduru, S., Love, A.,

Mann, F., May, D., McCawley, S., McIntosh, T.,

McMullen, I., Moy, M., Moy, L., Murphy, B.,

Nelson, K., Pfannkoch, C., Pratts, E., Puri, V.,

Qureshi, H., Reardon, M., Rodriguez, R.,

Rogers, Y.-H., Romblad, D., Ruhfel, B., Scott,

R., Sitter, C., Smallwood, M., Stewart, E.,

Strong, R., Suh, E., Thomas, R., Tint, N.N., Tse,

S., Vech, C., Wang, G., Wetter, J., Williams, S.,

Williams, M., Windsor, S., Winn-Deen, E.,

Wolfe, K., Zaveri, J., Zaveri, K., Abril, J.F.,

Guig#o, R., Campbell, M.J., Sjolander, K.V., Kar-

lak, B., Kejariwal, A., Mi, H., Lazareva, B., Hat-

ton, T., Narechania, A., Diemer, K.,

Muruganujan, A., Guo, N., Sato, S., Bafna, V.,

Istrail, S., Lippert, R., Schwartz, R., Walenz, B.,

Yooseph, S., Allen, D., Basu, A., Baxendale, J.,

Blick, L., Caminha, M., Carnes-Stine, J., Caulk,

P., Chiang, Y.-H., Coyne, M., Dahlke, C., Mays,

A.D., Dombroski, M., Donnelly, M., Ely, D.,

Esparham, S., Fosler, C., Gire, H., Glanowski, S.,

Glasser, K., Glodek, A., Gorokhov, M., Graham,

K., Gropman, B., Harris, M., Heil, J., Hender-

son, S., Hoover, J., Jennings, D., Jordan, C., Jor-

dan, J., Kasha, J., Kagan, L., Kraft, C., Levitsky,

A., Lewis, M., Liu, X., Lopez, J., Ma, D.,

Majoros, W., McDaniel, J., Murphy, S., New-

man, M., Nguyen, T., Nguyen, N., Nodell, M.,

Pan, S., Peck, J., Peterson, M., Rowe, W., San-

ders, R., Scott, J., Simpson, M., Smith, T., Spra-

gue, A., Stockwell, T., Turner, R., Venter, E.,

Wang, M., Wen, M., Wu, D., Wu, M., Xia, A.,

Zandieh, A. & Zhu, X. (2001) The sequence of

the human genome. Science, 291, 1304–1351.

Vesely, C., Frech, C., Eckert, C., Cario, G., Meck-

lenbrauker, A., zur Stadt, U., Nebral, K., Kraler,

F., Fischer, S., Attarbaschi, A., Schuster, M.,

Bock, C., Cav#e, H., von Stackelberg, A.,

Schrappe, M., Horstmann, M.A., Mann, G.,

Haas, O.A. & Panzer-Gr€umayer, R. (2017)

Genomic and transcriptional landscape of

P2RY8-CRLF2-positive childhood acute lym-

phoblastic leukemia. Leukemia, 31, 1491–1501.

Vogelstein, B., Papadopoulos, N., Velculescu, V.E.,

Zhou, S., Diaz, L.A. Jr& Kinzler, K.W. (2013)

Cancer genome landscapes. Science, 339, 1546–

1558.

Walker, B.A., Wardell, C.P., Melchor, L., Hulkki,

S., Potter, N.E., Johnson, D.C., Fenwick, K.,

Kozarewa, I., Gonzalez, D., Lord, C.J., Ashworth,

A., Davies, F.E. & Morgan, G.J. (2012) Intra-

clonal heterogeneity and distinct molecular

mechanisms characterize the development of t

(4;14) and t(11;14) myeloma. Blood, 120, 1077–

1086.

Walter, C., Pozzorini, C., Reinhardt, K., Geffers,

R., Xu, Z., Reinhardt, D., von Neuhoff, N. &

Hanenberg, H. (2018) Single-cell whole exome

and targeted sequencing in NPM1/FLT3 positive

pediatric acute myeloid leukemia. Pediatric

Blood & Cancer, 65, e26848.

Wang, Q., Shashikant, C.S., Jensen, M., Altman,

N.S. & Girirajan, S. (2017) Novel metrics to

measure coverage in whole exome sequencing

datasets reveal local and global non-uniformity.

Scientific Reports, 7, 885.

Warr, A., Robert, C., Hume, D., Archibald, A.,

Deeb, N. & Watson, M. (2015) Exome sequenc-

ing: current and future perspectives. G3: Genes,

Genomes, Genetics, 5, 1543–1550.

Welch, J.S., Ley, T.J., Link, D.C., Miller, C.A., Lar-

son, D.E., Koboldt, D.C., Wartman, L.D., Lam-

precht, T.L., Liu, F., Xia, J., Kandoth, C.,

Fulton, R.S., McLellan, M.D., Dooling, D.J.,

Wallis, J.W., Chen, K., Harris, C.C., Schmidt,

H.K., Kalicki-Veizer, J.M., Lu, C., Zhang, Q.,

Lin, L., O’Laughlin, M.D., McMichael, J.F.,

Delehaunty, K.D., Fulton, L.A., Magrini, V.J.,

McGrath, S.D., Demeter, R.T., Vickery, T.L.,

Hundal, J., Cook, L.L., Swift, G.W., Reed, J.P.,

Alldredge, P.A., Wylie, T.N., Walker, J.R., Wat-

son, M.A., Heath, S.E., Shannon, W.D., Vargh-

ese, N., Nagarajan, R., Payton, J.E., Baty, J.D.,

Kulkarni, S., Klco, J.M., Tomasson, M.H.,

Westervelt, P., Walter, M.J., Graubert, T.A.,

DiPersio, J.F., Ding, L., Mardis, E.R. & Wilson,

R.K. (2012) The origin and evolution of muta-

tions in acute myeloid leukemia. Cell, 150, 264–

278.

Wheeler, D.A., Srinivasan, M., Egholm, M., Shen,

Y., Chen, L., McGuire, A., He, W., Chen, Y.-J.,

Makhijani, V., Roth, G.T., Gomes, X., Tartaro,

K., Niazi, F., Turcotte, C.L., Irzyk, G.P., Lupski,

Review

ª 2019 British Society for Haematology and John Wiley & Sons Ltd 381British Journal of Haematology, 2020, 188, 367–382

130

J.R., Chinault, C., Song, X.-Z., Liu, Y., Yuan, Y.,

Nazareth, L., Qin, X., Muzny, D.M., Margulies,

M., Weinstock, G.M., Gibbs, R.A. & Rothberg,

J.M. (2008) The complete genome of an individ-

ual by massively parallel DNA sequencing. Nat-

ure, 452, 872–876.

Wood, L.D., Parsons, D.W., Jones, S., Lin, J., Sjo-

blom, T., Leary, R.J., Shen, D., Boca, S.M., Bar-

ber, T., Ptak, J., Silliman, N., Szabo, S., Dezso,

Z., Ustyanksky, V., Nikolskaya, T., Nikolsky, Y.,

Karchin, R., Wilson, P.A., Kaminker, J.S.,

Zhang, Z., Croshaw, R., Willis, J., Dawson, D.,

Shipitsin, M., Willson, J.K.V., Sukumar, S.,

Polyak, K., Park, B.H., Pethiyagoda, C.L., Pant,

P.V.K., Ballinger, D.G., Sparks, A.B., Hartigan,

J., Smith, D.R., Suh, E., Papadopoulos, N.,

Buckhaults, P., Markowitz, S.D., Parmigiani, G.,

Kinzler, K.W., Velculescu, V.E. & Vogelstein, B.

(2007) The genomic landscapes of human breast

and colorectal cancers. Science, 318, 1108–1113.

Xie, M., Lu, C., Wang, J., McLellan, M.D., John-

son, K.J., Wendl, M.C., McMichael, J.F., Sch-

midt, H.K., Yellapantula, V., Miller, C.A.,

Ozenberger, B.A., Welch, J.S., Link, D.C., Wal-

ter, M.J., Mardis, E.R., Dipersio, J.F., Chen, F.,

Wilson, R.K., Ley, T.J. & Ding, L. (2014) Age-

related mutations associated with clonal

hematopoietic expansion and malignancies. Nat-

ure Medicine, 20, 1472–1478.

Yan, B., Hu, Y., Ng, C., Ban, K.H., Tan, T.W.,

Huan, P.T., Lee, P.-L., Chiu, L., Seah, E., Ng,

C.H., Koay, E.S.-C. & Chng, W.-J. (2016)

Coverage analysis in a targeted amplicon-based

next-generation sequencing panel for myeloid

neoplasms. Journal of Clinical Pathology, 69,

801–804.

Yashiro, Y., Bannai, H., Minowa, T., Yabiku, T.,

Miyano, S., Osawa, M., Iwama, A. & Nakauchi,

H. (2009) Transcriptional profiling of

hematopoietic stem cells by high-throughput

sequencing. International Journal of Hematology,

89, 24–33.

Yoshida, K., Sanada, M., Shiraishi, Y., Nowak, D.,

Nagata, Y., Yamamoto, R., Sato, Y., Sato-

Otsubo, A., Kon, A., Nagasaki, M., Chalkidis,

G., Suzuki, Y., Shiosaka, M., Kawahata, R., Yam-

aguchi, T., Otsu, M., Obara, N., Sakata-Yanagi-

moto, M., Ishiyama, K., Mori, H., Nolte, F.,

Hofmann, W.-K., Miyawaki, S., Sugano, S.,

Haferlach, C., Koeffler, H.P., Shih, L.-Y., Hafer-

lach, T., Chiba, S., Nakauchi, H., Miyano, S. &

Ogawa, S. (2011) Frequent pathway mutations

of splicing machinery in myelodysplasia. Nature,

478, 64–69.

Yoshida, K., Toki, T., Okuno, Y., Kanezaki, R.,

Shiraishi, Y., Sato-Otsubo, A., Sanada, M., Park,

M.-J., Terui, K., Suzuki, H., Kon, A., Nagata, Y.,

Sato, Y., Wang, R.N., Shiba, N., Chiba, K.,

Tanaka, H., Hama, A., Muramatsu, H., Hase-

gawa, D., Nakamura, K., Kanegane, H., Tsuka-

moto, K., Adachi, S., Kawakami, K., Kato, K.,

Nishimura, R., Izraeli, S., Hayashi, Y., Miyano,

S., Kojima, S., Ito, E. & Ogawa, S. (2013) The

landscape of somatic mutations in Down syn-

drome-related myeloid disorders. Nature Genet-

ics, 45, 1293–1299.

Zhang, J., Ding, L., Holmfeldt, L., Wu, G., Heatley,

S.L., Payne-Turner, D., Easton, J., Chen, X.,

Wang, J., Rusch, M., Lu, C., Chen, S.-C., Wei,

L., Collins-Underwood, J.R., Ma, J., Roberts,

K.G., Pounds, S.B., Ulyanov, A., Becksfort, J.,

Gupta, P., Huether, R., Kriwacki, R.W., Parker,

M., McGoldrick, D.J., Zhao, D., Alford, D.,

Espy, S., Bobba, K.C., Song, G., Pei, D., Cheng,

C., Roberts, S., Barbato, M.I., Campana, D.,

Coustan-Smith, E., Shurtleff, S.A., Raimondi,

S.C., Kleppe, M., Cools, J., Shimano, K.A., Her-

miston, M.L., Doulatov, S., Eppert, K., Laurenti,

E., Notta, F., Dick, J.E., Basso, G., Hunger, S.P.,

Loh, M.L., Devidas, M., Wood, B., Winter, S.,

Dunsmore, K.P., Fulton, R.S., Fulton, L.L.,

Hong, X., Harris, C.C., Dooling, D.J., Ochoa,

K., Johnson, K.J., Obenauer, J.C., Evans, W.E.,

Pui, C.-H., Naeve, C.W., Ley, T.J., Mardis, E.R.,

Wilson, R.K., Downing, J.R. & Mullighan, C.G.

(2012) The genetic basis of early T-cell precur-

sor acute lymphoblastic leukaemia. Nature, 481,

157–163.

Zhang, J., Grubor, V., Love, C.L., Banerjee, A.,

Richards, K.L., Mieczkowski, P.A., Dunphy, C.,

Choi, W., Au, W.Y., Srivastava, G., Lugar, P.L.,

Rizzieri, D.A., Lagoo, A.S., Bernal-Mizrachi, L.,

Mann, K.P., Flowers, C., Naresh, K., Evens, A.,

Gordon, L.I., Czader, M, Gill, J.I., Hsi, E.D.,

Liu, Q., Fan, A., Walsh, K., Jima, D., Smith,

L.L., Johnson, A.J., Byrd, J.C., Luftig, M.A., Ni,

T., Zhu, J., Chadburn, A., Levy, S., Dunson, D.

& Dave, S.S. (2013) Genetic heterogeneity of

diffuse large B-cell lymphoma. Proceedings of the

National Academy of Sciences of the United States

of America, 110, 1398–1403.

Review

382 ª 2019 British Society for Haematology and John Wiley & Sons LtdBritish Journal of Haematology, 2020, 188, 367–382

131

Paper II

132

METHODS/TECHNIQUES

Molecular characterization of sorted malignant B cells from patientsclinically identified with mantle cell lymphoma

Marcus Høy Hansena, Oriane C!edilea, Mia Koldby Bluma, Simone Valentin Hansena,Lene Hyldahl Ebbesenb, Hans Herluf Nørgaard Bentzenb, Mads Thomassenc, Torben A. Krusec,

Stephanie Kavanc, Eigil Kjeldsenb, Thomas Kielsgaard Kristensena, Jacob Haabera,Niels Abildgaarda, and Charlotte Guldborg Nyvolda

aHaematology−Pathology Research Laboratory, Research Unit for Haematology and Research Unit for Pathology, University of Southern

Denmark and Odense University Hospital, Odense, Denmark; bDepartment of Hematology, Aarhus University Hospital, Denmark; cDepartment

of Clinical Genetics, Odense University Hospital, Denmark

(Received 6 February 2019; revised 6 February 2020; accepted 5 March 2020)

Mantle cell lymphoma (MCL) is a tumor with a poor prognosis. A few studies have examinedthe molecular landscape by next-generation sequencing and provided valuable insights intorecurrent lesions driving this heterogeneous cancer. However, none has attempted to cross-linkthe individual genomic and transcriptomic profiles in sorted MCL cells to perform individualmolecular characterizations of the lymphomas. Such approaches are relevant as MCL is heter-ogenous by nature, and thorough molecular diagnostics may potentially benefit the patientwith more focused treatment options. In the work described here, we used sorted lymphomacells from four patients at diagnosis and relapse by intersecting the coding DNA and mRNA.Even though only a few patients were included, this method enabled us to pinpoint a specificset of expressed somatic mutations, to present an overall expression profile different from thenormal B cell counterparts, and to track molecular aberrations from diagnosis to relapse.Changes in single-nucleotide coding variants, subtle clonal changes in large-copy-number alter-ations, subclonal involvement, and changes in expression levels in the clinical course provideddetailed information on each of the individual malignancies. In addition to mutations in knowngenes (e.g., TP53, CCND1, NOTCH1, ATM), we identified others, not linked to MCL, such as anonsense mutation in SPEN and an MYD88 missense mutation in one patient, which alongwith copy number alterations exhibited a molecular resemblance to splenic marginal zone lym-phoma. The detailed exonic and transcriptomic portraits of the individual MCL patientsobtained by the methodology presented here could help in diagnostics, surveillance, and poten-tially more precise usage of therapeutic drugs by efficient screening of biomarkers. © 2020Published by Elsevier Inc. on behalf of ISEH – Society for Hematology and Stem Cells.

Mantle cell lymphoma (MCL) is a B cell non-Hodgkinlymphoma in which translocation t(11;14)(q13;q32),transposing the cell cycle regulator cyclin D1 (CCND1)(11q13) to transcriptional control of the immunoglobu-lin heavy chain (IGH) locus (14q32), is considered to

be the primary oncogenic event [1]. This driver translo-cation leads to the constitutive overexpression ofCCND1 and cell cycle deregulation [2]. Yet, patientslacking the CCND1 signature, but with CCND2 overex-pression, have been identified in several studies [3,4],

MHH and OC contributed equally to this work.

Offprint requests to: Charlotte Guldborg Nyvold, Haematology−Pathology Research Laboratory, J. B. Winsløws Vej 15, 3rd floor,

Building 240, 5000 Odense C, Odense University Hospital, Denmark;

E-mail: [email protected]

Supplementary material associated with this article can be found in theonline version at https://doi.org/10.1016/j.exphem.2020.03.001.

0301-472X/© 2020 Published by Elsevier Inc. on behalf of ISEH – Society for Hematology and Stem Cells.

https://doi.org/10.1016/j.exphem.2020.03.001

Experimental Hematology 2020;84:7−18

133

and likewise, patients with constitutive CCND3 expres-sion have been recognized [5]. In addition to the chro-mosomal rearrangement of the cyclin D locus, otheraberrations, including mutations in the genes TP53 andATM, have been found to be highly involved in thedevelopment of the malignancy [6]. However, theseaberrations or recurrent combinations only occur in asubset of the patients, and thus large sequencing-basedassociation studies cannot portray the complete oncobi-ology underlying the lymphomas.

Two major subgroups of MCL are currently acknowl-edged with distinct prognoses: (1) the classic nodal MCLwith expression of the transcription factor SOX11 andabsence of somatic hypermutated, or minimally mutated,immunoglobulin V genes (IGHV); and (2) the nonnodalMCL subtype without SOX11 expression and mutatedIGHV, with the latter having the superior prognosis [7]. Inboth cases, the survival of MCL patients is overall poor,especially for patients resistant to frontline drugs. Despitethe frequent remissions observed in patients, relapses oftenoccur with disseminated lymphoma, high proliferation index,and refractoriness to treatment [7,8]. This clinical andmolecular heterogeneity of MCL impedes the discovery ofprognostic markers or drug targets by means of associationstudies, and the observed diversity calls for precision geno-mics. Many of the next-generation sequencing studies inother areas of hematology have included large cohorts [9−13] and have been instrumental in identifying driver muta-tions involved in oncogenesis and progression. Despite this,it is now clear that the different somatic lesions, that is,somatic mutations and copy number alterations, exhibithighly complex combinations even in more homogeneousmalignancies. Although a few studies using exome or RNAsequencing have shed some light on molecular constituentsof MCL [9,10,14−16], for example, by identifying recurrentmutations such as in NOTCH1, these are often found in asmall fraction of the patients, underpinning the heterogeneityand emphasizing the need for biological characterization ofthe individual tumor and its clonal progression. Striving forpotential treatment targets, there is an urgent need to charac-terize the striking diversity in MCL, not alone to gain moreprecise molecular handles for differential diagnosis of oftenhighly similar B cell malignancies as chronic lymphocyticleukemia, MCL, and splenic marginal zone lymphoma(SMZL). We are now in a position where the accumulatedknowledge from the larger studies and databases can andshould be used for more precise molecular diagnostics ofthe patients. Individual cases characterized thoroughly withcontemporary sequencing methods can still add importantknowledge to the field. Moreover, the genomics underlyingthe longitudinal progression in MCL has, to our knowledge,been addressed only in two studies involving comprehensivesequencing: by exome sequencing in a single study compris-ing 13 patients [10], and in a study including both exomeand transcriptome sequencing in a single patient [17].

Turning to technical aspects of such data-driven molecu-lar investigations, studies have revealed inconsistenciesbetween variant detection of whole-exome sequencing(WES) and RNA sequencing [18−20], while the combina-tion of WES and transcriptome sequencing has exhibitedhigher specificity [21,22]. Furthermore, DNA or RNAsequencings are often performed from bulk tissue samples,which may be a mosaic of malignant and non-malignantcells. Because of the low allele frequencies of variants inthese biological heterogeneous samples, the discriminationbetween a mutation and the background noise can be verydifficult.

In the study described here, we evaluated four patientsclinically diagnosed with MCL from initial sampling to dis-ease progression. This was performed to retrospectivelyinvestigate the diagnosis based on a molecular foundation,with the aim of helping future differential diagnostic proce-dures and for more efficient discovery of novel biomarkersin MCL. We also aimed to characterize the individual clonalprogression based on the molecular clues derived fromthe presented layout, although focusing on a few patients.We implemented an integrative strategy, a molecular dissec-tion of pure tumor cells, which combines cell sorting andrigorous sequencing involving both DNA and mRNA(Figure 1)—where the latter plays an important role in qual-ity assessment of the cell sorting and sample material. Fromthis strategy, sorted cells were used to circumvent poten-tially low allele frequency of somatic coding variants andinterference of non-malignant RNA expression, and theexome and transcriptome sequences were intersected to min-imize false-positive mutations. This provided a clear pictureof possible causal driver mutations and expression patterns;yet, the outlined strategy also mediated resolution of chro-mosomal copy number alteration (CNA), clonal information,and distinctly altered expression profiles between diagnosisand relapse on the basis of next-generation sequencing(NGS) modalities.

Methods

Patient cohorts, sample material, and cytogeneticsFor the exome and transcriptome sequencing, mononuclearcells (MNCs) from four patients diagnosed with MCL at Aar-hus University Hospital (AUH), Denmark, from 2005 to 2014were separated by Ficoll gradient centrifugation from bloodor bone marrow at diagnosis (n = 4) and at clinical relapse/progression (n = 5) and stored either in mRNA lysis buffer(Roche Diagnostics GmbH, Mannheim, Germany) (n = 2;>80% and 92% clonal cells by flow cytometry) or indimethyl sulfoxide (DMSO) in liquid nitrogen (n = 7) for sub-sequent cell isolation (Figure 2). B and control T cells wereisolated from the same tissue type for each patient. ControlRNA was derived from CD19+ B cells from two individualhealthy donors (iXCells Biotechnologies USA, San Diego,CA) and pooled healthy donor CD19+ B cells (Miltenyi Bio-tec, Bergisch Gladbach, Germany) from peripheral blood.

8 M.H. Hansen et al. / Experimental Hematology 2020;84:7−18

134

Chromosome analyses were done at AUH on unstimulatedovernight cultures of bone marrow aspirates. Interphasenuclei fluorescence in situ hybridization analyses were con-ducted with the following locus-specific directly labeledprobes: IGH, CCND1, TP53, D12Z3, D13S319 , and ATM(Abbott Molecular, Wiesbaden, Germany). Clinical data fromthe patients are summarized in Figure 2. All patients hadbeen treated according to clinical MCL protocols.

Cell sortingA minimum of 5 million cells from cryopreserved MNCs werestained with conjugated antibodies for CD19 (clone SJ25C1, BDHorizon, BD Biosciences, San Jose, CA), CD3 (clone SK7, BDBiosciences), light chain k or λ (polyclonal rabbit, Dako, Glostrup,Denmark), CD20 (clone 2H7, eBioscience, San Diego, CA), CD5(clone L17F12, BD Biosciences), FMC7 (BD Biosciences), and 7-aminoactinomycin D (7AAD) (BD Pharmingen, BD Biosciences).The 7AAD−CD3−CD19+light chain+CD5+CD20+FMC7+ cells(Supplementary Figure E1, online only, available at www.exphem.org) were sorted on a FACS ARIA III (BD, Franklin Lakes, NJ)with purity estimated for each isolated population (Figure 2).7AAD−CD3+ T cells were sorted from the same patients and usedas paired non-malignant control samples (Figures 1 and 2; Supple-mentary Figure E1).

DNA and RNA extractionDNA and RNA from cells in lysis buffer (mRNA lysis buffer,Roche, Basel, Switzerland) were extracted using the MagNA LCDNA isolation kit (Roche) and the MagNA pure LC RNA HighPerformance isolation kit (Roche), respectively, while DNA andRNA from sorted cells were extracted with the AllPrep DNA/RNA micro kit (Qiagen, Hilden, Germany).

Exome and RNA sequencingOne hundred nanograms of DNA from each of the ninemalignant B cell populations and the four paired CD3+ T cellpopulations (Figure 1) was used in exome sequencing. Sam-ple preparation was based on the NimbleGen SeqCap EZMedExome kit (Roche), and the library was sequenced onNextSeq550 (Illumina, San Diego, CA) with paired end reads(2£ 78 bp). Transcriptome mRNA sequencing, aimed at min-imum 66 million paired-end reads per sample (2£ 75 bp),was prepared with the Illumina TruSeq RNA Access kit from10 ng of total RNA. In total, 12 samples (9 MCL samplesand 3 CD19+ samples from healthy donors as controls) wereRNA sequenced on the NextSeq platform (Figure 1). Rawdata were quality checked (FastQC, Babraham Institute,Cambridge, UK) and aligned to the human reference genome(GRCh37) in the CLC Biomedical Workbench (Qiagen,

Figure 1. Workflow overview. Intotal, 25 samples from highly pure

cell populations were exome or

RNA sequenced (A). The down-

stream sequencing analysis utilizeddouble control-paired exome and

RNA intersection (B) to profile the

molecular intra- and interpatient dif-

ferences between MCL diagnosisand relapse/progression by means of

expressed somatic mutations, copy

number changes, similarities, andchanges in expression patterns (C).

M.H. Hansen et al. / Experimental Hematology 2020;84:7−18 9

135

Aarhus, Denmark). Alignment of exome and RNA sequenc-ing data was performed in parallel with the Burrows WheelerAligner (BWA, GRCh37 assembly) [23] and Tophat2/Bow-tie2 [24] for downstream crosscheck of somatic variants andquality control using GATK 3.6 and MuTect [25], with man-ual inspection in the Integrated Genomics Viewer (BroadInstitute). Detection of expressed somatic mutations, usingcontrol-paired exome−transcriptome intersection, was per-formed in the CLC Biomedical Workbench [21]. Single-nucleotide variant allele frequencies (VAFs) from MuTectwere used to assess clonal patterns [26].

Expression analysis involved statistical differences in groupmeans and patient-specific quantitative changes between diagno-sis and relapse. Patient samples and normal B cell controls werecompared by groupwise comparison as means of RNA qualitycontrol, sample purity assessment, and biomarker discovery. Nor-malization of expression values was based on scaling relative tosample mean [27], and altered gene expression was analyzedusing the uncorrected Smyth−Robinson exact test well suited forsmall sample sizes [28], excluding overrepresented immunoglob-ulin genes, and presented as heat map for all differentiallyexpressed genes. The groupwise analysis was performed with aminimum of a twofold expression difference and 10 reads perkilobase exon per million mapped reads (RPKM) difference from

the collective sample means. The results were not Bonferronicorrected because of the small sample size. Similarities betweenRNA expression signatures were assessed by the t-distributed sto-chastic neighbor embedding technique (t-SNE). Large shifts inexpression pattern between diagnosis and relapse (>100 RPKM,normalized to control B cell expression, p < 0.01) were evaluatedusing functional gene set enrichment analysis (GSEA, BroadInstitute, Cambridge, MA), and the most extremely up- or down-regulated genes (n = 100) sorted by absolute fold change andexpression values between diagnosis and relapse were evaluatedfor change in specific signaling pathways inferred statistically bythe hypergeometric test (Hallmark gene sets, Molecular Signa-tures Database, GSEA).

Whether the mutational profiles of diagnostic or relapsesamples caused gene expression of the implicated genes tobe shut down (RKPM <1) or transcribed (RKPM ≥1) wasassessed with Pearson’s x2 test for nominal values, combin-ing the somatic single-nucleotide coding variant (SNV) out-put from MuTect [25] and the normalized expression valuesfrom RNA sequencing in relation to control donor B cellsamples. Likewise, this was assessed for the individual diag-nosis−relapse samples alone using McNemar’s test for pairedsamples with dichotomous outcome, that is, expressed versusnonexpressed.

Figure 2. Diagnostic, cytogenetic, and sorting characteristics for the samples used in the study. Ara-C=cytarabine; CHOP=cyclophosphamide/

doxorubicin/vincristine/prednisolone; IHC=immunohistochemistry. BM=bone marrow; PB=peripheral blood; R=rituximab; PET=positron emis-

sion tomography; RFC=rituximab/fludarabine/cyclophosphamide; SCT=stem cell transplantation. *Purity of clonal B cells (CD19+/IGL) in MNC

obtained after a Ficoll gradient centrifugation. £Data are from PB (cells are sorted from BM in this patient). $From CD19+/CD5+. #Cytogeneticsperformed 32 months before this sample. ##Cytogenetics performed 2 months after this sample. ###Cytogenetics performed 1 month before this

sample. Blanks (—) indicate information was not available.

10 M.H. Hansen et al. / Experimental Hematology 2020;84:7−18

136

Quantitative PCRSelected differentially expressed genes (n = 12) from RNAsequencing were confirmed by correlation to quantitativepolymerase chain reaction (PCR) expression values of therespective patient and control B cell samples. The targettranscripts were reversed transcribed using the SuperscriptVILO cDNA Synthesis Kit (Invitrogen) and amplified byPCR using 300 nmol/L primer and 200 nmol/L probe forGUSB or 1£ TaqMan Gene Expression Assay otherwise(Life Technologies, Carlsbad, CA). Normalization was per-formed using the DCt method with GUSB as reference gene.Transcript expression for each gene was then correlated toRNA sequencing (Spearman’s rank correlation coefficient).DCt expression values were normalized to their median val-ues because of differences in sensitivity and range and scaledto the representative range of RNA sequencing RPKM forcomparison (two orders of magnitude). Biomarker candidates(n = 12) from differentially expressed genes were furtherinvestigated in a larger independent cohort of MCL and CLLpatients (n = 40) and non-malignant control subsets of periph-eral blood mononuclear cells (PBMCs) and CD19+ cells(n = 20) isolated from PBMC using the human CD19microbeads (Miltenyi Biotec). Statistical differences betweenthe qPCR cohorts were evaluated with the Kruskal−Wallistest using Dunn’s method for multiple comparisons.

Chromosomal copy number variation was uncovered bypaired comparisons of read depth (VarScan 2 [29]) based onBWA alignments and significant change in allele frequenciesfrom GATK output by Fisher’s exact test (p < 0.001).Results were compared with cytogenetics, when available.

Ethical considerationsThe project was approved by the National Committee onHealth Research Ethics, Denmark (Approval No. 1605184),and data were handled in agreement with requirements of theDanish Data Protection Authority.

ResultsThe usage of relapse and control paired sorted periph-eral blood or bone marrow samples in this study wasbased on the previous observation that peripheral bloodand bone marrow expression profiles of MCL signaturegenes are not expected to be significantly different[30,31]. Mantle cell lymphomas with leukemic presen-tation, a common feature [32,33], were selected on thebasis of the criteria of practical, homogenous samplingat different time points to compare molecular profileswith those of circulating B-cells and other malignan-cies, such as chronic lymphocytic leukemia (CLL).Malignant B-cells and non-malignant T-, representingpaired samples for each patient, were sorted accordingto a rigorous gating strategy (Supplementary FigureE1) with the sorted cells having mean purities of98.9% and 98.7% for MCL and T cell controls, respec-tively (Figure 2). Downstream sequencing analyseswere based on data with a mean number of sequencingreads of 164£ 106 with 99.3% mapping efficiency for

WES and 96£ 106 reads with 93.1% mapping effi-ciency for RNA sequencing.

Assessment of RNA expressionGroupwise comparison of patient samples and B cellcontrols, as means of RNA quality control, assessmentof sample purity, and discovery of potential biomarkercandidates, revealed two distinct expression profiles(Figure 3A,B). The expression profiles of diagnosis andrelapse of each patient were more similar than those ofthe respective samples compared with B cell controls(Figure 3A). Thus, all malignant samples were evalu-ated against the control cell samples. Also, the diagno-sis−relapse pairs exhibited a consistent degree ofrelative difference, whether drawn from peripheralblood or bone marrow, with the exception of a secondrelapse sample (Figure 3A). This coding transcriptomeanalysis, defined by a threshold of minimum twofoldexpression difference and at least 10-RPKM differencefrom the collective sample means, resulted in 132 aber-rantly expressed genes (Figure 3B, p < 0.01, not Bon-ferroni corrected because of small sample size), withthe majority being downregulated (113/132). Explicitlymphoma and lymphocyte signature genes weredetected among the 19 significantly upregulated genesin the MCL group, such as CCND1, CD5, CD27,CD1C and BLNK, with a highly variable expression ofupregulated transcripts. Upregulation of CCND1 wascongruent with laboratory diagnostics. The patientsoverexpressed CCND1 with the exception of patient 4,in agreement with the confirmed CCND1 translocationt(11;14) negativity revealed by cytogenetics performedon blood at diagnosis and progression. In contrast, pre-vious clinical analyses involving immunohistochemistry(IHC) were ambiguous, with CCND1-positive bonemarrow and negative peripheral blood (Figure 2).SOX11 was found to be negative in patient 4, at bothdiagnosis and progression, by IHC or qPCR.

Twelve of the 19 significantly upregulated genes(SYNE2, ST14, NAPSB, LILRA4, ACP5, CCDC50, CD1C,PTPRJ, TRANK1, BLNK, MAP4K1, SMC6 ), based mainlyon their potential association or known association withB cell signaling or B cell malignancy [34−39], were corre-lated with qPCR assays for the same 12 samples subjectedto RNA sequencing for validation (9 MCL samples and 3CD19+ cell samples from healthy individuals). Thesequencing expression values (RPKM) from these geneswere all positively correlated with qPCR expression valuesto a modest to highly significant degree (p < 0.05, Supple-mentary Figure E2, online only, available at www.exphem.org), with a median rSpearman of 0.83 (0.63−0.95) and avariable dynamic range for each qPCR assay. The collec-tive correlation of qPCR versus RNA sequencing revealedacceptable congruence using qPCR values normalized torespective assay medians (rspearman = 0.75, p < 10−4;

M.H. Hansen et al. / Experimental Hematology 2020;84:7−18 11

137

Figure 3C). As the RNA expression analysis potentiallypointed to novel biomarker candidates, a subset of the sig-nificantly overexpressed genes were evaluated in a largercohort of MCL and differential diagnosis CLL and twocontrol cohorts (Supplementary Figure E3, online only,

available at www.exphem.org). Both SYNE2 and PTPRJwere differentially upregulated in the MCL and CLLcohorts compared with the CD19+ controls (Figure 3D,pKruskal−Wallis < 0.01, adjusted for multiple comparisons),while differential expression in comparison to PBMC

Figure 3. Differentially up- anddownregulated genes in the

MCL samples. The differential

expression of genes between

malignant B-cells (diagnosis andrelapse) and CD19+ B-cells

from healthy donors, illustrated

in a t-SNE plot (A), was defined

by a threshold of a minimumtwofold expression difference

and at least 10 RPKM differen-

ces of the group means. Thisdifferential expression of genes

resulted in 19 upregulated genes

(p < 0.01, Robinson−Smyth

exact test for small samplesizes) represented in the heat

map (B). This heat map illus-

trates the relative difference

from the collective samplemeans (black), being either

upregulated (red, saturated at

the upper quartile for visual

contrast) or downregulated(green, saturated at the lower

quartile). Twelve genes from the

heatmap were tested for expres-sion correlation using qPCR,

and all were significantly corre-

lated with the results from RNA

sequencing, individually (Sup-plementary Figure E2) and col-

lectively (C), normalized to

each qPCR gene assay median

expression values and scaled tothe RPKM range for compari-

son). SYNE2, PTPRJ, and sev-

eral other genes (SupplementaryFigure E3) were significantly

upregulated in both MCL and

CLL compared with PBMC or

CD19+ cells (D). LILRA4 fol-lowed a bimodal expression dis-

tribution for the MCL and CLL

groups, being either up- or

downregulated, and may dividethe patients in two prognostic

groups for further investiga-

tions. TRANK1 was significantly

upregulated in MCL comparedwith CLL and may provide

another specific marker to dis-

criminate MCL from CLL.Combined, these four genes are

highly specific biomarkers for

the two hematological malig-

nancies.

12 M.H. Hansen et al. / Experimental Hematology 2020;84:7−18

138

subsets was specific to PTPRJ and TRANK1 (pKruskal−Wallis

< 0.0001). In contrast to the other marker candidates,LILRA4 exhibited a distinct bimodal expression distribu-tion, within both the MCL (pKolmorogov−Smirnov < 0.01) andCLL (pKolmorogov−Smirnov = 0.02) groups, compared withthe control sets. In combination, LILRA4, SYNE2, PTPRJ,and TRANK1 expression profiles were specific for thehematological malignancies compared with the controlsamples, advocating for the implementation of a naiveBayes classifier or other machine learning techniques.

Transcribed somatic mutationsThe observed mean depth of coverage for somaticmutations, present in both exome and mRNA, was 150.This WES−RNA pairing resulted in a highly specificvariant set, with a median of 17 (7−24) expressedsomatic SNVs and indels per sample (SupplementaryTable E1, online only, available at www.exphem.org).In comparison, the de facto algorithm for SNVs inMuTect [25] produced a median of 187 (120−965)unfiltered candidates, using the T cell subset as normalcontrols. Removal of variants with low depth of cover-age (DP < 20), low-frequency variants (VAF < 10%),and common polymorphisms in the European popula-tion (frequency >1%) led to a median of 32 (12−57)SNVs. The number of overlapping mutations in thequality-filtered somatic SNVs from MuTect and theWES−RNA paired mutations was discouraging (n = 10,e.g., in TP53, IGF2R, IL4R, SPEN, and IQGAP2),underscoring the need for additional measures to con-firm whole-exome sequencing results in general.

The intersected exome and transcriptome data of the fourpatients revealed four distinct somatic profiles, with respectto both transcribed driver candidates and allelic frequencies,such as arising MYD88 p.S219C or lost mutations (e.g., inBCLAF1 and TRAF3; Figure 4, Supplementary Table E1).Plausible drivers were selected on the basis of either director indirect association with B cell survival and lymphomafrom PubMed search (Mesh Terms lymphoma and geneassociation). The systematic search was based on the specif-ically transcribed somatic mutations (7−24). Although thisapproach increases the specificity of the analyses, it is alsopotentially biased toward mutations that do not cause loss ofexpression, for example, nonsense-mediated decay, and maythus fail to detect driving nonsense mutations. However, weobserved no statistical difference in expressed or unex-pressed mutated genes in sorted malignant cells, individualdiagnosis or relapse, or all malignant samples in combina-tion, compared with control donor CD19+ B-cells (p > 0.05,Pearson’s x2 test, 1 RPKM threshold), nor did we observeany difference between diagnosis and relapse (p > 0.05,McNemar’s paired test, 1 RPKM threshold). Collectively,there was no indication that the somatic nonsense mutations,or other point mutations detected by whole-exome sequenc-ing alone, led to an expressional shift of the involved genes

in RNA sequencing. This stability was underscored by thefact that only two nonsynonymously mutated genes werenoticeable by nonstatistical, quantitative assessment:NCAPG (p.E858K, patient 3 diagnosis) and IQGAP2 (p.R169H, patient 3), with the first being marginally expressed(1.9 RPKM vs. 0.3 RPKM) and the latter being decreasedor shut down (<1 RPKM) in both diagnosis and relapsecompared with control B-cells (7 RPKM). All nonsensemutated genes (n = 22) from MuTect were in line with thedonor B cell counterpart, that is, either not expressed (10genes in total) or expressed (12 genes) with no difference inexpression (median of 10 vs. 9 RPKM).

No somatic mutations involving identical positions in thesame genes were shared between the four patients, albeitgenes involved in similar pathways or functions wereaffected. We identified hits potentially involved in DNA-damage response (TP53 and ATM) and intertwined B cellsurvival pathways, for example, through NF-kB (ATM,TRAF3 [40], BCLAF1 [41], MYD88, IL4R [42]), JNK(MAP4K2 [43]) or JAK-STAT (IL4R [44], TRAF3 [45]),and BCL2 association (BCLAF1 [46]), thus accounting forthe majority of the potential expressed driver mutations.Collectively, genes involved in the Notch signaling pathwayor Notch cross-talk (NOTCH1, SPEN [47], CCND1, TP53,and ATM) were affected in all four patients. Loss of hetero-zygosity involving the TP53 mutations (patients 2 and 3,Supplementary Figures E4 and E5, online only, available atwww.exphem.org, and Supplementary Table E1) was sup-ported by large 17p deletions of 12 and 19 megabases (Mb).Furthermore, the shift in NOTCH1 and double-mutatedATM was explained by acquired trisomies of chromosome 9and 11 at relapse (Supplementary Figures E4 and E5).Patient 3 was the only one exhibiting a distinct change inallele frequency distributions based on somatic single-nucle-otide variants between diagnosis and the two relapse sam-ples (Supplementary Figure E6, online only, available atwww.exphem.org) not caused by CNAs, indicating a dis-crete clone at diagnosis and at least two at relapse. Alldetected CNAs (>1 Mbp) were larger than 1 Mb andaffected 2−14 chromosomes for each patient sample. Themost prominent change from diagnosis to relapse progres-sion was the subtle clearance of a deletion on chromosome13, which was accompanied by a change from a chromo-some 12p deletion to a clonal chromosomal 12 gain atrelapse (Supplementary Figures E4 and E5, patient 1).

Expressional changes from diagnosis to progressionThe difference in B cell control normalized expressionbetween diagnosis and relapse revealed broad functionalenrichment (p and FDR q < 0.001, GSEA, Broad Institute,>100 RPKM absolute change) in immune regulatory pro-cesses and signaling, defense response, regulation of celldeath, and chromatin structure (Figure 5; SupplementaryFigure E7, online only, available at www.exphem.org).Most distinct was the frequency of altered expression in

M.H. Hansen et al. / Experimental Hematology 2020;84:7−18 13

139

histone genes (81 occurrences) in Kr€uppel-like factors (11occurrences) and human leukocyte antigens (25 occur-rences). Only two genes in these gene sets, JUN and FOS,were altered by this magnitude in all four patients, with thelatter proto-oncogenes notorious as heterodimeric bindingpartners of the AP-1 transcription factor complex in B cellsignaling and function. There were no systematic observa-tions in the directional change between diagnosis andrelapse, and none of the functional enrichments were signifi-cantly up- or downregulated in comparison to healthy con-trol B-cells. In contrast, non-normalized pathway analysis ofextreme fold and expressional changes (100 most increasedand decreased gene expressions) between individual diagno-sis−relapse pairs revealed a highly significant enrichment ingenes regulated by NF-kB to tumor necrosis factor (14−21of 200 gene members in the set, GSEA M5890, p < 10−3

[0]; see Supplementary Table E2, online only, available atwww.exphem.org) among other hallmark gene sets, whichinclude NOTCH, JAK-STAT, HEDGEHOG, and WNT sig-naling pathway. The NF-kB gene members were eitherdownregulated (Figure 6, patients 1 and 3) or upregulated(patients 2 and 4) and, thus, not directionally specific for therelapse.

DiscussionThe aim of this study was to develop a method of focusedcharacterization, that is, a molecular portrait, of clinicallydiagnosed cases of MCL to improve upfront molecularclassification of diagnosis and progression of individual

cases and to screen more efficiently for novel biomarkersof this highly heterogenous lymphoma. To accomplishthis, we used sorted cell populations from MCL patientscombined with integrated WES and RNA sequencing anal-yses. The intersection of the two methods allowed us todetect somatic mutations with higher precision, comparedwith WES alone on bulk tumor samples, leaving behinduntranscribed mutations and erroneous somatic callsthrough sequencing noise reduction. This observation wasin agreement with a previous study on cytogenetically nor-mal acute myeloid leukemia [21]. One drawback of thismethodology pertains to the potential false-negative muta-tions. Thus, all somatic candidates from WES alone wereevaluated in parallel, but the larger number of variants didnot add to the results in terms of probable drivers. Wealso found that the expression status of mutated genes wasnot significantly altered, when compared with the expres-sion profiles of control B-cells. Neither did we observeany change between diagnostic sample and relapse foreach patient in the involved genes. These observationssupported the application of direct DNA and RNA pairing.We relied on peripheral blood or bone marrow as a sourceof malignant cells for consistency and ease of sampling,although we are aware that lymph node biopsies may pres-ent different expression profiles or even different genomiclesions. Such potential differences are not explored here.

To our knowledge, this is the first study describing themolecular progression in MCL using the elaborated combi-nation on highly pure populations of neoplastic cells. Not

Figure 4. Somatic allele fre-quency changes. The small num-

ber of expressed somatic

mutations (7−24) were evalu-

ated against online databases(prominently PubMed, COS-

MIC, dbSNP, and ClinVar) to

locate mutated genes with driver

potential at diagnosis andrelapse. The illustrated muta-

tions with allele frequency

changes are either known classicdriver genes or of direct interest

with respect to association with

B cell pathways.

14 M.H. Hansen et al. / Experimental Hematology 2020;84:7−18

140

unexpectedly, four general molecular signatures of diagnosisand relapse progression were observed: (1) rise and fall ofsingle- and small-nucleotide coding variants; (2) large copynumber alterations, exemplified by both a stable CNA pro-file and complex chromosomal aberrations from diagnosis torelapse; (3) subclonal involvement as indicated by variantallele frequency analysis; and (4) change in expression lev-els at patient relapse compared with its diagnostic counter-part. Despite the heterogeneous mutation profile of theindividual neoplasms, we were able to identify an atypicalcase, clinically diagnosed MCL from morphology and flowcytometry, which on the molecular level was closely related

to a CD5+ SMZL. This assessment was, among other fea-tures, based on its molecular identity with SPEN [48] andMYD88 mutations, discussed in the following.

Each case was marked by a small number of potentialdriver mutations compared with previous studies [10,14],while all four patients had high-burden mutations related toB cell pathways and functions. Confirmation of recurrentmutations in MCL, such as in TP53, NOTCH1, CCND1,and ATM [9,10,14], was accompanied by the identificationof expressed somatic mutations in genes unknown or veryrare in MCL (SPEN, MYD88) and lesser known (TRAF3),all linked to lymphocyte signaling pathways. The mutationin SPEN introduced a stop codon (p.R1170*), and the muta-tion in TRAF3 was located next to a zinc finger domain (p.G123D). The transcriptional repressor SPEN has beenreported to be a negative regulator of the Notch pathway[49] and is therefore of extreme interest in the pathogenesisof lymphoid neoplasms. Along the same lines, the tumorsuppressor TRAF3 is a regulator of B cell survival, andmutations in this gene have previously been found in bothmyeloid malignancies and other lymphomas, such as diffuselarge B cell lymphoma [50,51]. Introduction and loss ofmutations between diagnosis and relapse—MYD88,BCLAF1, and TRAF3—underpinned the evolving clonalprogression of the lymphomas in a steady background ofearly founding mutations, while allele frequency shifts atrelapse also occurred as a result of evident chromosomal

Figure 5. B Cell normalized gene set enrichment analysis. Func-

tional gene set enrichment analysis between diagnosis and relapseusing healthy B cell normalized expression was performed by identi-

fying genes with large expression differences between diagnosis and

relapse (>100 RPKM), with the exception of ribosomal transcripts.The analysis revealed significant overlap with G0 gene sets related

to B cell function, immune response, regulation of cell death, and

DNA structuring (GSEA, Broad Institute). The full list of genes is

available in Supplementary Figure E7.

Figure 6. Quantitative change in central pathway signaling between

diagnosis and relapse. The most extreme gene expression changes

between diagnosis and relapse, that is, the 100 most upregulated and

downregulated, were highly significant for the NF-kB signaling path-way (7.34£ 10!13 ≥ phypergeometric ≥ 1.36£ 10−23). The median

observed absolute fold change value between diagnosis and relapse

was 4 (3−6 for each patient) and the median absolute expression dif-ference was 39 reads (20−75) per kilobases per million sequencing reads,

with two- and threefold changes in patients 1 and 3, respectively. Among 50

Hallmark gene sets (GSEA, Broad Institute), NF-kB pathway member genes

were predominantly increased (marked red) in patients 1 and 3, in contrastto a decrease in patients 2 and 4 (green).

M.H. Hansen et al. / Experimental Hematology 2020;84:7−18 15

141

duplication (ATM, NOTCH1; Figure 4)—or as a possiblesubclonal involvement (Supplementary Figure E6). MYD88mutations, which occur frequently in other lymphomas, arenot recurrent in MCL [9,52]. The specific mutationdescribed here (MYD88 p.S219C) has been reported in atleast 24 other cases of B cell malignancies (COSMICVersion 86), including in SMZL [48].

In these settings, with relatively high depth ofexome coverage for each sample, CNA detection bymeans of read depth correlations was a powerful toolfor clonal surveillance. All paired tumor samples wereclonally related but exhibited new relapse features. Weidentified a chromosome 13 deletion (patient 1) at diag-nosis that cleared at relapse (Supplementary FiguresE4A and E5). In addition, a trisomy of chromosome 12superseded a deletion of the respective chromosome.These findings have the curious implication to unravela prediagnostic clone, expanding after remission to thepoint of relapse. Although this patient exhibited com-plex copy alterations affecting more than half of thechromosomes (14 aberrated chromosomes in total; Sup-plementary Figure E5), another patient was devoid ofapparent CNA relapse progression (patient 4, Supple-mentary Figure E5) with a distinct copy profile of lowcomplexity. This distinct copy profile, which held achromosome 7q31−32 deletion (7 Mb), is rarely foundin MCL [53], but is associated with SMZL [54]. Evenif this case may still diagnostically or pathologically bedefined as MCL, the knowledge gained from such amolecular screening holds potentially relevant informa-tion for the clinical course of the patient. Althoughalso treated with rituximab, SMZLs are often indolent,and response rates to splenectomy are high [55]. Thechromosomal copy status of the other samples gener-ally reflected the recurrent copy number alterations inMCL [56], such as 3q and 8q gains, with the latterbeing associated with poor survival through MYConcogene amplification [57−59].

Turning to RNA sequencing, in addition to demonstratinggenes with known impact on the malignant transformation,such as the known MCL driver CCND1, we observed signif-icantly overexpressed genes for the MCL group that are partof the normal B cell pathway. These included B cell linker,BLNK, and CD27, but also yielded relevant genes ofunknown implications, such as suppression of tumorigenicity14 (ST14) (Figure 3B). The expression levels of the normalB cell counterpart in these genes were highly similar andprovided a reasonable resolution despite the small samplesize. Our focus on the sorted malignant B cell populationsprovided the unique possibility of comparing these purepopulations with the normal lymphocytic counterpart forcharacterization of the tumors and quality control of RNAsequencing. Whereas groupwise differential expression(Figure 3B) confirmed biological and clinical features, suchas lack of CCND1 expression and t(11;14) negativity in

peripheral blood by cytogenetics (patient 4), one of the otheraspects of the expression analysis was to uncover shifts ingene expression from diagnosis to relapse. The gene setenrichment profiles for patient progression revealed thesame theme of biological core functions, such as immuneresponse, regulation of cell death, and DNA structuring,even after expression values were normalized relative to thatof CD19+ control B-cells. Furthermore, it was found thatNF-kB pathway members, also active in normal B-cells,were enriched in a highly significant manner in diagnosis−relapse expressional change, in both directions. While thepotential implication is uncertain, these results collectivelyunderline that the online database resources and extensiveliterature can now be implemented for direct profiling of theindividual malignancy and thereby be included as a part ofprecision medicine strategy, even for the single patient. Oneconcrete example in this study is the large absolute expres-sional change in FOS and JUN, which are components ofthe AP-1 transcription factor, and central to B cell develop-ment and regulation of CCND1. These are also intertwinedwith the NF-kB pathway. Thus, this suggests that the AP-1complex may be a potential target in precision medicinetoward mantle cell lymphoma. As treatment options in Bcell neoplasms include inhibition of central pathway genes,for example, the Bruton’s tyrosine kinase (BTK) inhibitoribrutinib and the phosphatidylinositol-3-kinase (PI3K) inhibi-tor idelalisib [60], a thorough characterization of the affectedsignaling pathways of the malignant cells in the single patientbecomes increasingly pivotal because of molecular overlapand cross-talk. A detailed knowledge of the exonic and tran-scriptomic landscape of the neoplasm could imply an evenmore focused use of inhibitory drugs than presently imple-mented in clinical practice. Patients carrying mutations ingenes involved in the Notch signaling pathways, for example,through SPEN, might directly benefit from treatment withNOTCH1 inhibitors designed toward other subsets of cancers[61]. A huge advantage of the in-depth characterization ofmalignant cell populations regarding altered pathway genesis that a large number of targeted drugs already have beendeveloped [60,62], and in the future, the focus should be ontesting the interaction of these drugs in functional assaysand, when applicable, in early-phase clinical studies.

Further investigation of the differentially overex-pressed genes between MCL and control B-cells fromRNA sequencing revealed several potential biomarkersfor discrimination of MCL and CLL from the normallymphocyte counterparts. Two of these genes, in partic-ular, LILRA4 and SYNE2, divided both involved patientcohorts into distinct entities and may hold specificprognostic or biological information on the individuallymphomas. The significance of these findings remainsunknown, and further samples and studies are requiredto assess the qPCR assays as robust predictive features.LILRA4 is one of the leukocyte immunoglobulin-likereceptors, which are involved in modulation of immune

16 M.H. Hansen et al. / Experimental Hematology 2020;84:7−18

142

responses, whereas SYNE2 is involved in the structuralintegrity of the cell (some evidence points to a role incyclin-dependent kinase inhibition through p21) [63]and is thus posed for a direct role together with theclassic hallmark of MCL, the oncogene cyclin D1. Thecollective results from qPCR, involving the cohort pre-sented here, indicate that the presented methodologyusing pure cell fractions is suited for the discovery ofpotential biomarkers. Future work will elucidate therole of these and other genes included in this study,which in many cases have not been previously investi-gated for their involvement in MCL. None of theinvestigated genes differentiated MCL from CLL.

In summary, we have demonstrated that the intersectedanalysis of genome and transcriptome sequencing usingsorted lymphoma cells directly disclosed driver mutationsthat fortify, adjust, or refute diagnosis. The informationderived from this integrative strategy improved the charac-terization of the underlying lymphoma by several differentmolecular handles suited for analyses of a small number ofpatients. Not only does such streamlining provide a higherresolution of the diagnostic lymphoma profile and progres-sive clone of individual patients, compared with exomesequencing on bulk material, but also reveals immediatedrivers and potential biomarkers specific to MCL. Such“fine-grained” molecular portraits may thus hold direct clini-cal potential because candidate genes are potential targets ofalready developed precision medicine toward specific signal-ing pathways.

AcknowledgmentsWe thank Dorthe Aamose Andersen and Nina Friis Jen-sen (Department of Pathology, Odense University Hos-pital), Jette Møller (Department of Clinical Genetics,Odense University Hospital), and Karin Brændstrup(Department of Hematology, Aarhus University Hospi-tal) for excellent technical assistance. We thank KjellTitlestad for providing healthy donor samples. We aregrateful to The Einar Willumsen Foundation and TheResearch Foundation at Odense University Hospital forhaving supported this project.

References1. Tsujimoto Y, Jaffe E, Cossman J, Gorham J, Nowell PC, Croce

CM. Clustering of breakpoints on chromosome 11 in human B-cell neoplasms with the t(11;14) chromosome translocation.

Nature. 1985;315:340–343.

2. Cheah CY, Seymour JF, Wang ML. Mantle cell lymphoma.J Clin Oncol. 2016;34:1256–1269.

3. Salaverria I, Royo C, Carvajal-Cuenca A, et al. CCND2 rear-

rangements are the most frequent genetic events in cyclin D1(−)

mantle cell lymphoma. Blood. 2013;121:1394–1402.4. Kurita D, Takeuchi K, Kobayashi S, et al. A cyclin D1-negative

mantle cell lymphoma with an IGL-CCND2 translocation that

relapsed with blastoid morphology and aggressive clinical

behavior. Virchows Arch. 2016;469:471–476.

5. Royo C, Salaverria I, Hartmann EM, Rosenwald A, Campo E,

Bea S. The complex landscape of genetic alterations in mantlecell lymphoma. Semin Cancer Biol. 2011;21:322–334.

6. Ahmed M, Zhang L, Nomie K, Lam L, Wang M. Gene muta-

tions and actionable genetic lesions in mantle cell lymphoma.

Oncotarget. 2016;7:58638–58648.7. Swerdlow SH, Campo E, Pileri SA, et al. The 2016 revision of

the World Health Organization classification of lymphoid neo-

plasms. Blood. 2016;127:2375–2390.8. Sandoval-Sus JD, Sotomayor EM, Shah BD. Mantle cell lymphoma:

contemporary diagnostic and treatment perspectives in the age of per-

sonalized medicine. Hematol Oncol Stem Cell Ther. 2017;10:99–115.

9. Bea S, Valdes-Mas R, Navarro A, et al. Landscape of somaticmutations and clonal evolution in mantle cell lymphoma. Proc

Natl Acad Sci USA. 2013;110:18250–18255.

10. Wu C, de Miranda NF, Chen L, et al. Genetic heterogeneity in

primary and relapsed mantle cell lymphomas: impact of recur-rent CARD11 mutations. Oncotarget. 2016;7:38180–38190.

11. Ley TJ, Miller C, Ding L, et al. Genomic and epigenomic land-

scapes of adult de novo acute myeloid leukemia. N Engl J Med.2013;368:2059–2074.

12. Faber ZJ, Chen X, Gedman AL, et al. The genomic landscape of core-

binding factor acute myeloid leukemias. Nat Genet. 2016;48:1551–1556.

13. Novak AJ, Asmann YW, Maurer MJ, et al. Whole-exome analy-sis reveals novel somatic genomic alterations associated with

outcome in immunochemotherapy-treated diffuse large B-cell

lymphoma. Blood Cancer J. 2015;5:e346.

14. Zhang J, Jima D, Moffitt AB, et al. The genomic landscape ofmantle cell lymphoma is related to the epigenetically determined

chromatin state of normal B cells. Blood. 2014;123:2988–2996.

15. Rosenwald A, Wright G, Wiestner A, et al. The proliferation

gene expression signature is a quantitative integrator of onco-genic events that predicts survival in mantle cell lymphoma.

Cancer Cell. 2003;3:185–197.

16. Kridel R, Meissner B, Rogic S, et al. Whole transcriptomesequencing reveals recurrent NOTCH1 mutations in mantle cell

lymphoma. Blood. 2012;119:1963–1971.

17. Chiron D, Di Liberto M, Martin P, et al. Cell-cycle reprogram-

ming for PI3K inhibition overrides a relapse-specific C481SBTK mutation revealed by longitudinal functional genomics in

mantle cell lymphoma. Cancer Discov. 2014;4:1022–1035.

18. O’Rawe J, Jiang T, Sun G, et al. Low concordance of multiple

variant-calling pipelines: practical implications for exome andgenome sequencing. Genome Med. 2013;5:28.

19. O’Brien TD, Jia P, Xia J, et al. Inconsistency and features of

single nucleotide variants detected in whole exome sequencingversus transcriptome sequencing: a case study in lung cancer.

Methods. 2015;83:118–127.

20. Piskol R, Ramaswami G, Li JB. Reliable identification of genomic

variants from RNA-seq data. Am J Hum Genet. 2013;93:641–651.21. Hansen MC, Herborg LL, Hansen M, Roug AS, Hokland P.

Combination of RNA- and exome sequencing: increasing speci-

ficity for identification of somatic point mutations and indels in

acute leukaemia. Leuk Res. 2016;51:27–31.22. Wilkerson MD, Cabanski CR, Sun W, et al. Integrated RNA and

DNA sequencing improves mutation detection in low purity

tumors. Nucleic Acids Res. 2014;42:e107.23. Li H, Durbin R. Fast and accurate short read alignment with Burrows

−Wheeler transform. Bioinformatics. 2009;25:1754–1760.

24. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL.

TopHat2: accurate alignment of transcriptomes in the presence ofinsertions, deletions and gene fusions. Genome Biol. 2013;14:R36.

25. Cibulskis K, Lawrence MS, Carter SL, et al. Sensitive detection

of somatic point mutations in impure and heterogeneous cancer

samples. Nat Biotechnol. 2013;31:213–219.

M.H. Hansen et al. / Experimental Hematology 2020;84:7−18 17

143

26. Hansen MC, Nederby L, Kjeldsen E, Petersen MA, Ommen HB,

Hokland P. Case report: exome sequencing identifies T-ALLwith myeloid features as a IKZF1-struck early precursor T-cell

malignancy. Leuk Res Rep. 2018;9:1–4.

27. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of

normalization methods for high density oligonucleotide array databased on variance and bias. Bioinformatics. 2003;19:185–193.

28. Robinson MD, Smyth GK. Small-sample estimation of negative

binomial dispersion, with applications to SAGE data. Biostatis-tics. 2008;9:321–332.

29. Koboldt DC, Zhang Q, Larson DE, et al. VarScan 2: somatic

mutation and copy number alteration discovery in cancer by

exome sequencing. Genome Res. 2012;22:568–576.30. Simonsen AT, Schou M, Sorensen CD, Bentzen HH, Nyvold

CG. SOX11, CCND1, BCL1/IgH and IgH-VDJ: a battle of mini-

mal residual disease markers in mantle cell lymphoma? Leuk

Lymphoma. 2015;56:2724–2727.31. Hamborg KH, Bentzen HH, Grubach L, Hokland P, Nyvold CG.

A highly sensitive and specific qPCR assay for quantification of

the biomarker SOX11 in mantle cell lymphoma. Eur J Haematol.2012;89:385–394.

32. Cohen PL, Kurtin PJ, Donovan KA, Hanson CA. Bone marrow

and peripheral blood involvement in mantle cell lymphoma. Br J

Haematol. 1998;101:302–310.33. Ferrer A, Salaverria I, Bosch F, et al. Leukemic involvement is a com-

mon feature in mantle cell lymphoma. Cancer. 2007;109:2473–2480.

34. Liou J, Kiefer F, Dang A, et al. HPK1 is activated by lympho-

cyte antigen receptors and negatively regulates AP-1. Immunity.2000;12:399–408.

35. Farfsing A, Engel F, Seiffert M, et al. Gene knockdown studies

revealed CCDC50 as a candidate gene in mantle cell lymphoma and

chronic lymphocytic leukemia. Leukemia. 2009;23:2018–2026.36. Dong HY, Shahsafaei A, Dorfman DM. CD148 and CD27 are

expressed in B cell lymphomas derived from both memory and

naive B cells. Leuk Lymphoma. 2002;43:1855–1858.37. Delia D, Cattoretti G, Polli N, et al. CD1c but neither CD1a nor

CD1b molecules are expressed on normal, activated, and malig-

nant human B cells: identification of a new B-cell subset. Blood.

1988;72:241–247.38. Akhter A, Street L, Ghosh S, et al. Concomitant high expression

of Toll-like receptor (TLR) and B-cell receptor (BCR) signalling

molecules has clinical implications in mantle cell lymphoma.

Hematol Oncol. 2017;35:79–86.39. Filarsky K, Garding A, Becker N, et al. Kruppel-like factor 4

(KLF4) inactivation in chronic lymphocytic leukemia correlates

with promoter DNA-methylation and can be reversed by inhibi-tion of NOTCH signaling. Haematologica. 2016;101:e249–e253.

40. Hacker H, Redecke V, Blagoev B, et al. Specificity in Toll-like

receptor signalling through distinct effector functions of TRAF3

and TRAF6. Nature. 2006;439:204–207.41. Shao AW, Sun H, Geng Y, et al. Bclaf1 is an important NF-kap-

paB signaling transducer and C/EBPbeta regulator in DNA dam-

age-induced senescence. Cell Death Differ. 2016;23:865–875.

42. Nappo G, Handle F, Santer FR, et al. The immunosuppressivecytokine interleukin-4 increases the clonogenic potential of pros-

tate stem-like cells by activation of STAT6 signalling. Oncogen-

esis. 2017;6:e342.43. Pombo CM, Kehrl JH, Sanchez I, et al. Activation of the SAPK path-

way by the human STE20 homologue germinal centre kinase. Nature.

1995;377:750–754.

44. Vigano E, Gunawardana J, Mottok A, et al. Somatic IL4R mutations inprimary mediastinal large B-cell lymphoma lead to constitutive JAK-

STAT signaling activation. Blood. 2018;131:2036–2046.

45. Muro I, Fang G, Gardella KA, Mahajan IM, Wright CW. The

TRAF3 adaptor protein drives proliferation of anaplastic largecell lymphoma cells by regulating multiple signaling pathways.

Cell Cycle. 2014;13:1918–1927.

46. Horiuchi K, Kawamura T, Iwanari H, et al. Identification of

Wilms’ tumor 1-associating protein complex and its role inalternative splicing and the cell cycle. J Biol Chem. 2013;288:

33292–33302.

47. Oswald F, Kostezka U, Astrahantseff K, et al. SHARP is a novelcomponent of the Notch/RBP-Jkappa signalling pathway. EMBO

J. 2002;21:5417–5426.

48. Martinez N, Almaraz C, Vaque JP, et al. Whole-exome sequencing in

splenic marginal zone lymphoma reveals mutations in genes involvedin marginal zone differentiation. Leukemia. 2014;28:1334–1340.

49. VanderWielen BD, Yuan Z, Friedmann DR, Kovall RA. Tran-

scriptional repression in the Notch pathway: thermodynamic

characterization of CSL-MINT (Msx2-interacting nuclear targetprotein) complexes. J Biol Chem. 2011;286:14892–14902.

50. San Miguel JF. Introduction to a series of reviews on multiple

myeloma. Blood. 2015;125:3039–3040.51. Bushell KR, Kim Y, Chan FC, et al. Genetic inactivation of

TRAF3 in canine and human B-cell lymphoma. Blood. 2015;

125:999–1005.

52. Ondrejka SL, Lin JJ, Warden DW, Durkin L, Cook JR, Hsi ED.MYD88 L265P somatic mutation: its usefulness in the differen-

tial diagnosis of bone marrow involvement by B-cell lymphopro-

liferative disorders. Am J Clin Pathol. 2013;140:387–394.

53. Clipson A, Wang M, de Leval L, et al. KLF2 mutation is themost frequent somatic change in splenic marginal zone lym-

phoma and identifies a subset with distinct genotype. Leukemia.

2015;29:1177–1185.

54. Mateo M, Mollejo M, Villuendas R, et al. 7q31-32 allelic loss isa frequent finding in splenic marginal zone lymphoma. Am J

Pathol. 1999;154:1583–1589.

55. Bennett M, Schechter GP. Treatment of splenic marginal zone lymphoma:splenectomy versus rituximab. Semin Hematol. 2010;47:143–147.

56. Bea S, Salaverria I, Armengol L, et al. Uniparental disomies,

homozygous deletions, amplifications, and target genes in mantle

cell lymphoma revealed by integrative high-resolution whole-genome profiling. Blood. 2009;113:3059–3069.

57. Setoodeh R, Schwartz S, Papenhausen P, et al. Double-hit man-

tle cell lymphoma with MYC gene rearrangement or amplifica-

tion: a report of four cases and review of the literature. Int JClin Exp Pathol. 2013;6:155–167.

58. Aukema SM, Siebert R, Schuuring E, et al. Double-hit B-cell

lymphomas. Blood. 2011;117:2319–2331.59. Yi S, Zou D, Li C, et al. High incidence of MYC and BCL2

abnormalities in mantle cell lymphoma, although only MYC

abnormality predicts poor survival. Oncotarget. 2015;6:42362–

42371.60. Jerkeman M, Hallek M, Dreyling M, Thieblemont C, Kimby E,

Staudt L. Targeting of B-cell receptor signalling in B-cell malig-

nancies. J Intern Med. 2017;282:415–428.

61. Sanchez-Martin M, Ambesi-Impiombato A, Qin Y, et al. Syner-gistic antileukemic therapies in NOTCH1-induced T-ALL. Proc

Natl Acad Sci USA. 2017;114:2006–2011.

62. Smolewski P, Witkowska M, Robak T. Treatment options formantle cell lymphoma. Expert Opin Pharmacother. 2015;16:

2497–2507.

63. Han C, Liao X, Qin W, et al. EGFR and SYNE2 are associated

with p21 expression and SYNE2 variants predict post-operativeclinical outcomes in HBV-related hepatocellular carcinoma. Sci

Rep. 2016;6:31237.

18 M.H. Hansen et al. / Experimental Hematology 2020;84:7−18

144

Paper III

145

SoftwareX 11 (2020) 100503

Contents lists available at ScienceDirect

SoftwareXjournal homepage: www.elsevier.com/locate/softx

Original software publication

CNAplot — Software for visual inspection of chromosomal copynumber alteration in cancer using juxtaposed sequencing read depthratios and variant allele frequenciesMarcus H. Hansen a,⇤, Peter Hokland b, Charlotte G. Nyvold a

a Hematology-Pathology Research Laboratory, Department of Hematology, Department of Clinical Research, Odense University Hospital andUniversity of Southern Denmark, Odense, Denmarkb Department of Clinical Medicine, Aarhus University, Aarhus, Denmark

a r t i c l e i n f o

Article history:Received 9 March 2020Received in revised form 23 April 2020Accepted 4 May 2020

Keywords:Copy number alterationCopy number variationWhole exome sequencingNext generation sequencingCopy neutral loss of heterozygosity

a b s t r a c t

Cancer genomes are frequently struck by chromosomal lesions. Next generation sequencing providespotentially high-resolution assays in a single analysis to detect copy number alterations (CNA). Theimplementation and usage of sequencing may partly be hampered by the lack of transparency in copynumber calling algorithms. Here, we present the software CNAplot for aligned visualization of variantallele frequencies and read depth ratios generated from paired whole exome sequencing samples. Weimplement transparent statistics to evaluate shifts in allele frequencies and sequencing read counts inorder to support a copy number event, to complement other tools and to detect somatic copy-neutralloss of heterozygosity.

© 2020 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license(http://creativecommons.org/licenses/by/4.0/).

Code metadata

Current code version v12Permanent link to code/repository used for this code version https://github.com/ElsevierSoftwareX/SOFTX_2020_52Legal Code License List one of the approved licensesCode versioning system used NoneSoftware code languages, tools, and services used XojoCompilation requirements, operating environments Xojo (Xojo, Houston, TX, USA) and MBS Xojo ChartDirector pluginSupport email for questions [email protected]

Software metadata

Current software version CNAplot 1.2Permanent link to executables of this version doi.org/10.7910/DVN/BWY6SKLegal Software License MITComputing platforms/Operating Systems Linux, OS X, Microsoft WindowsInstallation requirements None. CNAplot does not install any files. Note that on first launch the user

must accept to open the software in Windows 10 and dismiss the warningfrom Windows Defender Smartscreen. Likewise, the user must accept to openthe software in OS X.

Support email for questions [email protected]

⇤ Corresponding author.E-mail address: [email protected] (M.H. Hansen).

1. Motivation and significance

Cancer genomes are frequently marked by several chromo-somal lesions, such as translocations where a part of one chro-mosome attaches to another, acquired copies of whole or partsof chromosomes or chromosomal losses. Specific aberrations are

https://doi.org/10.1016/j.softx.2020.1005032352-7110/© 2020 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

146

2 M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503

often recurrent and hold diagnostic or prognostic value to clini-cians or researchers. Until recently, analyses on these copy num-ber alterations (CNA) have mainly been reserved to the field andlaboratory techniques of cytogenetics. Next generation sequenc-ing (NGS) has made it possible to explore these chromosomalaberrations in a single assay with a very high resolution as com-pared to several former techniques. In spite of the potentiallyvaluable information and practical aspects, which can supportthe laboratory work-up and conclusions of the cytogeneticist orpathologist, the implementation may be hampered by the lack oftransparency in copy number calling algorithms, lack of simplicityor applicability.

Whole exome sequencing (WES) has been utilized in the de-tection of CNA for a decade, since in-solution capture-based as-says entered the scene [1]. Studies show that WES generallyprovides a lower resolution or an increase in noise compared towhole genome sequencing [2,3]. Also, the concordance betweenalgorithms frequently implemented in published research stud-ies, such as XHMM, CoNIFER, and CNVnator, ADTEx, CONTRA,Control-FREEC, EXCAVATOR, ExomeCNV, Varscan2 etc. have beenfound to be low [4–6]. The majority of tools are initiated throughcommand-line interfaces with no graphical interaction, and oftenrequires some basic knowledge of shell scripting or have more orless complicated workflows, such as CNVnator [7]. It is found thatthe few papers that offer a web-based tool are unavailable a fewyears after publication, such as exemplified by VCS (visualizationof CNV or SNP). In either case, the possibility to assess changes inchromosomal copies by altered sequencing read depth and alongwith affected variant allele frequencies is not straight forwardusing the currently available command line tools. This furtheremphasizes the need for practical software with such capabilities.

Here, we present the software CNAplot developed for the pur-pose of plotting variant allele frequencies juxtaposed read depthratios between case sample and control sample from WES witha medium to high depth of coverage, and which is produced inthe same sequencing run. With this tool, the chromosomal copynumber alterations are visualized along with significantly alteredallele frequencies for fast assessment of acquired copy alterations.In addition, nonparametric statistical tests of observed aberrantregions are provided to support copy change or even copy-neutralloss of heterozygosity by absence of altered read depths. Thus,the provided software may aid the researchers, or cytogeneticists,by directly complementing other analyses and by being easy touse directly on variant call formatted files (VCF) and depth ofcoverage of defined regions.

2. Software description

CNAplot takes input generated from whole exome next gen-eration sequencing reads, as illustrated by the example describedbelow. The sample purity and sequencing must be of sufficientquality and coverage. In general, we do not recommend exomesequencing with less than 50 million reads per sample as it maylead to poor resolution. Two types of files are needed for eachof the paired samples: (1) read counts retrieved from providedintervals (BED), such as gene regions as used in the examplehere, and (2) variants in the variant call format (VCF). Followingimport of data, the variants are visually aligned to the positionsof respective calculated read depth ratios (Fig. 1). Significant shiftbetween focus variant allele frequencies (VAF) and control sampleVAF is calculated from Pearson’s �2 (Eqs. (1)–(2)) or, optionally,Fisher’s exact test for samples with lower coverage (not recom-mended). Based on initial testing of multiple sequencing samplesfrom leukemia and lymphoma patients, we evaluated that McNe-mar’s paired test for related sample pairs was not superior, whencompared to �2. Hence, all sample pairs are analytically regardedas independent.

Table 12 ⇥ 2 contingency table of case-control paired allele depths (Dsample,allele).

SamplesCase Normal

Allele A DC,A DN,AB DC,B DN,B

The significance threshold is selected to match the number ofvariants, n, i.e. the number of multiple tests performed and theaccepted threshold of false positives. The following summarizesthe test for a 2 ⇥ 2 contingency table (Table 1, Eqs. (1)–(2)),given one degree of freedom and significance level of ↵ = 10�4

applicable to testing hundreds of variants per chromosome. Here,the test evaluates the number of observed reference and alternateallele (O) given by its depth (D) to the expected (E):

�2Pearson =

nX

i=1

(Oi � Ei)2

Ei> |15.14| (1)

= N(DC,ADN,B � DN,ADC,B)2

(DC,A + DC,B)(DN,A + DN,B)(DC,A + DN,A)(DC,B + DN,B)(2)

pcompound =nY

i=1

p(i) (3)

Generally, we have observed that for high quality sequencingdata a Bonferroni corrected threshold of ↵ = ↵uncorr/n is com-patible with the presented analytical workflow. The compoundprobability (Eq. (3)) is calculated from the longest stretch ofsignificantly altered variants and provides supporting evidence ofa large copy change. As example, the theoretical probability ofcoincidentally observing three successive, independently, alteredvariants, with the proposed ↵, is then 10�12.

Statistical evaluation of difference between binned read countsextracted from specified intervals is performed with Wilcoxonrank-sum test (Mann–Whitney U [8]), thus treating both un-matched and matched samples, i.e. from the same individual,as independent samples. In comparison, the Wilcoxon signed-rank test [9,10] for related case-control samples showed inferiorperformance for CNV detection. The implemented algorithm isthus:

1. Let list be a 1⇥ 2n matrix, where the first half, list(1 to n),is the case with n number of observations, and the secondhalf is the normal control, list(1 + n to 2n), with eachholding the same number of observations. Each elementis the number of reads of a given interval normalized tothe number of total reads for the respective sample. In theprovided example, each row represents the read count fora given gene.

2. Add index column with i1 to i2n to the list3. Sort this list matrix from the lowest to the highest value,

removing tie rows.4. Add a new index column with i01 to i02n to the list5. Then the test value is the sum of the ranks, U = Sum(i0)

for i1 to in or U = Sum(i0) for in+1 to i2n.

The selected number of observations in each bin is expected tobe high, i.e n � 20, thus U is approximately Gaussian [10] and thestandardized rank statistics value is then evaluated as follows, for↵ = 0.05 (Eqs. (4)–(6)):

z = |U � µU

�U| > 1.96 (4)

µU = n2

2(5)

147

M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503 3

Fig. 1. CNAplot input requirements. Two file types are needed for each of the paired samples: (1) read counts in specified intervals (BED), such as gene regions, and(2) variants in the variant call format (VCF). Software executables are available for OS X, Linux and Windows.

�U =r

n2(2n + 1)12

(6)

2.1. Software architecture

CNAplot has been developed in the cross-platform program-ming environment Xojo (Xojo, Houston, TX, USA). Compiled exe-cutables are provided for OS X, Windows 10 and Linux (Ubuntu),with no additional software required to run the tool. The softwareimplements parts of the proprietary chart and graph plottinglibrary, ChartDirector (Advanced Software Engineering Ltd, HongKong) through MBS Xojo ChartDirector plugin (Christian SchmitzSoftware GmbH, Nickenich, DE). The imported data is stored inan in-memory SQLite database created at runtime. Thus, the RAMusage is proportional to number of variants and number of readcounts. Memory usage was 125 MB under OS X for the providedexample files.

2.2. Software functionalities

In summary, the major functionalities of the software are (1)visual alignment of variants and read counts for evaluation ofVAF shifts and altered read counts for the detection of somaticchromosomal copy-affecting and copy-neutral events, (2) statisti-cal inference and (3) export of plots to file (JPEG). The individualplots of each chromosome include an estimated position of the

centromer relative to the flanking variants on the respective p-arm and q-arm. The positions have been retrieved from the UCSCTable Browser (http://genome.ucsc.edu, UCSCGenome Informat-ics Group, University of California, Santa Cruz, CA, USA) for theassemblies GRCh37/hg19 and GRCh38/hg38 (track: ChromosomeBand, table: cytoBand). This data is preloaded in CNAplot (currentversion 1.2).

3. Illustrative example

In this section, we will provide an example of usage basedon paired-end whole exome sequencing on the Illumina platform(San Diego, CA, USA) with a case of T-cell acute lymphoblasticleukemia (T-ALL) from a previously published report [11], and acontrol sample derived from a paired skin sample with approx-imately 90 ⇥ 106 reads each. T-ALL is a malignancy caused byneoplastic proliferation of immature precursors of T-cells. It isoften aggressive with a rapid progression.

The following provides a brief step-by-step description onhow to generate the necessary input files for CNAplot underLinux. Note, these steps are not required to run the softwarewith the provided examples files, nor is it the required workflow.Alternatively, user input files may be produced with bcftools [12]for variant calling and GATK 3.8 (Broad Institute, Cambridge,MA, USA) DepthOfCoverage command for obtaining read countsin specified regions (see supplied example input files).

148

4 M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503

Fig. 2. CNAplot requires variants and sequencing read counts in intervals of suitable sizes. The variants must be in single sample variant call format (VCF) as itparses the included genotype information to variant allele frequencies. Read counts input is provided in browser extensible data format (BED), and may be generatedwith bedtools multicov or other similar tools. See example files for correct input format. The data is loaded into memory upon import through SQLite (version 3).

3.1. Installation of necessary tools to generate CNAplot input files

We recommend that alignment of short paired-end reads fromWES is performed with Burrows Wheeler aligner (BWA) [13].BWA can be installed on Linux by opening a terminal windowand typing sudo apt-get install bwaor as a conda package, condainstall -c bioconda bwa (Anaconda Inc, Austin, TX, USA). PicardTools (Broad Institute) and bedtools [14] are installed in thesame manner. GATK4 is downloaded directly from the homepage(gatk.broadinstitute.org) or installed as a conda package.

3.2. Alignment of sequences to the human reference genome, sortingand indexing

Open a terminal window and type the following command toperform alignment to your provided reference genome and targetregions:

bwa mem -M -R "@RG\tID: \tSM: \tPL: \tLB: \tPU: "\referencegenome.fasta $file1 $file2 | samtools view \-L regions.bed -Sb -> aligned.bam

The resulting alignments can be inspected in Integrative Ge-nomics Viewer (IGV) [15,16] after the sorting and indexing of theBAM file, e.g. with Picard or Samtools. Note that if your wholeexome sequencing library preparation is amplicon-based insteadof capture-based, hard filtering of PCR duplications with GATK4MarkDuplicates is not recommended as it will remove a largefraction of the total reads.

PicardCommandline SortSam INPUT=aligned.bam \OUTPUT=aligned.sorted.bam SORT_ORDER=coordinate

PicardCommandline BuildBamIndex INPUT=aligned.sorted.bam

3.3. Retrieval of sequencing counts

Obtaining read depths for provided intervals of your alignmentcan be performed using the multicov tool, which is a part of thebedtools utilities [14]. As example, interval regions (e.g. refGene)in the Browser Extensible Data format (BED) can be retrievedfrom the UCSC Table Browser at genome.ucsc.edu or the providedexample BED-file can be downloaded from the data repository(http://dx.doi.org/10.7910/DVN/BWY6SK, Harvard Dataverse) andused as intervals (GRCh37 assembly). The resulting output con-tains read counts for a specific gene with two files created foreach sequencing pair –e.g. case sample and control sample.

bedtools multicov -bams aligned.sorted.bam \-bed regions.bed > aligned.sorted.bed

3.4. Variant calling as input for CNAplot allele frequency retrieval

We recommend GATK4 for variant calling [17–19]. Each align-ment in the sample-control sequencing pair is processed throughthe HaplotypeCaller tool in the genome analysis toolkit. Althoughbase quality score is optional with the BaseRecalibrator and Apply-BQSR tools, it is highly recommended for improved assessmentof base qualities, e.g. for downstream filtering (See the current

149

M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503 5

Fig. 3. This figure shows all variants aligned to the read count ratios of specified intervals in supplied BED-files. Here, each ratio represents a gene. As the number ofgenes roughly approximates the number of short coding variants, it is highly practical for the comparison of copy number alteration affected variant allele frequenciesand read counts. The example here reveals a trisomy 4 and a smaller significant allele frequency shift on chromosome 9. Significant copy of calls on 12, 14, 19 and22 may be false-positives at the specified significance level, ↵ = 0.0001(�2), and is expected at this threshold. The compound probability provided by the softwarecan be used to support copy changes, where two or more successive allele frequency alterations occur.

GATK4 documentation for more details). HaplotypeCaller can de-tect variants directly by providing the generated alignments asinput:

gatk HaplotypeCaller \-R referencegenome.fasta \-L regions.bed \-I alignment.sorted.bam \-O variants.vcf

The resulting files from the outlined steps, i.e. a paired set ofBED-files and a paired set of VCF, are all what is needed as inputin CNAplot (Fig. 2).

Note that genotype information has been removed from thesupplied examples VCF files. Loading of the provided sample datashows two highly significant alterations: (1) a chromosome 4trisomy and (2) a chromosome 9 copy-neutral deletion (Figs. 3and 4). Significant changes in variant allele frequency is backed byperforming a U-test on change in read depths of the focus samplecompared to that of the control sample (Fig. 5). The combinedassessment of variant allele frequencies and read depth ratios en-ables the researcher to locate copy-neutral loss of heterozygosity– a feature which may be highly relevant as a supplemental toolfor proper characterization of diagnosis and follow-up in a subsetof the cancer patients.

The latest addition to the CNAplot software (version 1.2) es-timates the boundary between the p- and q-arm of the selectedchromosomes, here exemplified by a case of relapsed mantle celllymphoma with both copy loss and gain on chromosome 8 (Figs. 6and 7, not included as example data on Harvard Dataverse).

3.5. Troubleshooting

We have observed very poor resolution for samples sequencedwith less than 30 million reads, and therefore our recommenda-tion is 50 million reads, or preferably more, from capture-basedwhole exome sequencing library prepared from high quality DNA.We have not tested sample amplicon-based library preparationkits. Also, note that the control sample must be sequenced inthe same sequencing run. This is, empirically, important for bothcapture and amplicon-based sequencing libraries. It may not benecessary for the read depth control to be derived from the sameindividual. However, using paired control sample variants fromthe same individual with a similar amount of reads is necessaryto generate optimal results

Variant files from whole genome sequencing can attain verylarge file sizes, so it may be necessary to restrict variants to,for example, the coding regions. Also, both WGS and WES maycontain a large number of false positive variants. Thus, filteringon the basis of quality is necessary in any downstream analysison next generation sequencing data. We refer to the includedexample as a guideline for the number of observations. Currently,CNAplot implements analysis of autosomes only.

Note that the first launch under OS X it may be necessary toCTRL-click on the CNAplot icon and selecting ‘‘Open’’ to acceptlaunching software from an unidentified developer. Alternatively,in some less restrictive user settings it is adequate to acceptthat the software is launched for the first time. In Windows 10it is required to dismiss the warning from Windows DefenderSmartscreen in order to run the software. CNAplot does not install

150

6 M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503

Fig. 4. The user can zoom in on individual chromosomes to evaluate alterations in variant allele frequencies (VAF) or read depth ratios. Each significantly changedVAF is marked by an aligned blue dot. The compound probability is calculated from the longest stretch of significantly altered variants. Exemplified here is a trisomyof chromosome 4, with variant allele frequencies and read depth ratios in strong agreement. Both are significantly altered between case and control samples, using�2 (VAF) and Mann–Whitney U-test statistics (bins of ratios with 50 gene regions in each).

any files, but runs directly from the downloaded and unzippedexecutables.

Source code is provided at the data repository.(http://dx.doi.org/10.7910/DVN/BWY6SK, Harvard Dataverse).

4. Impact

Investigation of acquired copy number alterations in cancer re-search and clinical laboratory diagnostics, prognostics and follow-up using next generation sequencing is highly relevant as itmay support the cytogeneticist and clinical decision making. Oneprominent example is the deletion of 17q, which is recurrentlyobserved together with TP53 mutations in leukemias [20,21],lymphomas and other types of cancer. While whole exome se-quencing has now passed its first decade of existence [1], andhas especially gained popularity and routine usage in diagnosticlaboratories the last couple of years, the focus has mainly been onnucleotide variant detection. Detection of copy number alterationfrom WES data has also been an area of intense interest, butcompared to variant calling, which to a large extent has beendriven by the computational guidelines from the Broad Institutewith its Genome Analysis ToolKit (GATK), a general consensus onthe workflow of detecting chromosomal gains or losses has notbeen established. While GATK4 Best Practices Workflows on copynumber variation are being developed, either the focus is germline copy number calls, or it requires a very large Panel of Normalsof suggested 40 control samples or more from a consistent librarypreparation and sequencing setup. Most often, this is not withinreach for the majority of researchers.

While many different types of CNA calling software exist, notmany tools combine allele frequencies and read depth ratios,

nor do they support plotting of both together. Some tools onlyrequire a single sample, such as the sophisticated QDNAseq Rtool [22] for whole genome sequencing which relies on depthof coverage analyses alone, and thus does not detect acquiredcopy-neutral loss of heterozygosity. As sequencing costs continueto drop, the use of paired case-control samples, allele frequencyand read depth juxtaposition has several advantages: Foremost,it gives the ability to determine somatic aberrations. Secondly,it strengthens the evidence of a given chromosomal copy alter-ation and provides the opportunity to detect acquired uniparentaldisomy, which is not visible from read depth ratios. Finally, theinclusion of a normal control sample in the same sequencing runhelps to diminish the batch variation and noise inherent to wholeexome sequencing, and next generation sequencing in general,such as the change of coverage in GC-rich regions. Such variationis often reproduced in the paired sample, thereby attenuatinglarge fluctuations in read depth ratios and dispersion of allelefrequencies. As a matched control is most often used for thedetection of somatic mutations, it is frequently included in adiagnostic setup or for a specific research purpose.

5. Conclusions

In conclusion, the presented lightweight plotting and analysissoftware provides fast means to evaluate copy number alter-ations, juxtaposing both variant allele frequencies and sequencingread depth ratios from control-paired cancer patient samples.This feature enables the evaluation of copy neutral chromosomallosses, which is still neglected to some extent in cancer researchand clinical diagnostics. As very few tools with graphical userinterface exist, and none are in the presented form, this software

151

M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503 7

Fig. 5. Observing a very low compound probability is a strong indicator of large chromosomal lesions. Thus, such values are unlikely in data of otherwise highquality data by chance if a reasonable significance threshold is selected, e.g. ↵ = 0.0001. Here, a copy-neutral loss of heterozygosity is evident from the short armof chromosome 9: Variant allele frequencies have changed significantly, while the copy number is unchanged.

Fig. 6. This illustrative example shows chromosome 8 from a patient with mantle cell lymphoma (not provided as example data). The plot reveals that a large partof the p-arm has been deleted in either of the parental copies. Also, multiple copies of chromosome 8q are detected. Because a large part of the chromosomescontains one or more copy number alterations, the baseline has been shifted 20%. This is not needed in most cases as the baseline is set by the median value ofthe read depth ratios. A significant change in variant allele frequency is observed throughout chromosome 8.

may be valuable to many researchers. We hope that CNAplot mayhelp to map the extent of copy-neutral events in various cancerforms and aid in the detection of copy number alterations, in

general, by its graphical user interface and transparency. The soft-ware is available free of charge and can be downloaded throughthe provided link.

152

8 M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503

Fig. 7. A significant change in read depths of genes on chromosome 8 is observed for the mantle cell lymphoma sample. This conclusion is based on bins of 30genes in each statistical test, using the Mann–Whitney U-test. As of CNAplot version 1.2 the location on the chromosome for each bin is shown.

Funding

Expenses for this project was covered privately by the firstauthor (DK VAT/CVR: 28960638) with research support from theUniversity of Southern Denmark and Odense University Hospital(OUH).

Declaration of competing interest

The authors declare that they have no known competing finan-cial interests or personal relationships that could have appearedto influence the work reported in this paper.

Acknowledgments

The authors thank Dina Mohyeldeen, MD, for critical proof-reading of the final draft. The first and senior author wish tothank the research colleagues at Hematology–Pathology ResearchLaboratory, OUH.

Appendix A. Supplementary data

Supplementary material related to this article can be foundonline at https://doi.org/10.1016/j.softx.2020.100503.

References

[1] Hansen MC, Haferlach T, Nyvold CG. A decade with whole exome sequenc-ing in haematology. Br J Haematol 2020;188(3):367–82. http://dx.doi.org/10.1111/bjh.16249.

[2] Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, et al. Whole-genome sequencing is more powerful than whole-exome sequencing fordetecting exome variants. Proc Natl Acad Sci USA 2015;112(17):5473–8.http://dx.doi.org/10.1073/pnas.1418631112, Epub 2015 Mar 31.

[3] Lelieveld SH, Spielmann M, Mundlos S, Veltman JA, Gilissen C. Comparisonof exome and genome sequencing technologies for the complete Captureof protein-coding regions. Hum Mutat 2015;36(8):815–22. http://dx.doi.org/10.1002/humu.22813.

[4] Yao R, Zhang C, Yu T, Li N, Hu X, Wang X, et al. Evaluation of three read-depth based CNV detection tools using whole-exome sequencing data. MolCytogenet 2017;10:30. http://dx.doi.org/10.1186/s13039-017-0333-5.

[5] Zare F, Dow M, Monteleone N, Hosny A, Nabavi S. An evaluation ofcopy number variation detection tools for cancer using whole exomesequencing data. BMC Bioinformatics 2017;18(1):286. http://dx.doi.org/10.1186/s12859-017-1705-x.

[6] Nam JY, Kim NK, Kim SC, Joung JG, Xi R, Lee S, et al. Evaluation of somaticcopy number estimation tools for whole-exome sequencing data. BriefBioinform 2016;17(2):185–92. http://dx.doi.org/10.1093/bib/bbv055.

[7] Abyzov A, Urban AE, Snyder M, Gerstein M. Cnvnator: an approach todiscover, genotype, and characterize typical and atypical cnvs from familyand population genome sequencing. Genome Res 2011;21(6):974–84. http://dx.doi.org/10.1101/gr.114876.110.

[8] Mann HB, Whitney DR. On a test of whether one of two random variablesis stochastically larger than the other. Ann Math Statist 1947;18(1):50–60.http://dx.doi.org/10.1214/aoms/1177730491.

[9] Wilcoxon F. Individual Comparisons by Ranking Methods Frank WilcoxonBiometrics Bulletin, 1 (6) 1945 80-83.

153

M.H. Hansen, P. Hokland and C.G. Nyvold / SoftwareX 11 (2020) 100503 9

[10] Rey D, Neuhäuser M. Wilcoxon-signed-rank test. In: M. Lovric, editor. In-ternational encyclopedia of statistical science. Berlin, Heidelberg: Springer;2011.

[11] Hansen MC, Nederby L, Kjeldsen E, Petersen MA, Ommen HB, Hokland P.Case report: Exome sequencing identifies T-ALL with myeloid featuresas a IKZF1-struck early precursor t-cell malignancy. Leukemia Res Rep2017;9:1–4. http://dx.doi.org/10.1016/j.lrr.2017.11.002.

[12] Li H. A statistical framework for SNP calling, mutation discovery, as-sociation mapping and population genetical parameter estimation fromsequencing data. Bioinformatics 2011;27(21):2987–93. http://dx.doi.org/10.1093/bioinformatics/btr509.

[13] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinform (Oxford, England) 2009;25(14):1754–60.http://dx.doi.org/10.1093/bioinformatics/btp324.

[14] Quinlan AR, Hall IM. Bedtools: a flexible suite of utilities for comparinggenomic features. Bioinform (Oxford, England) 2010;26(6):841–2. http://dx.doi.org/10.1093/bioinformatics/btq033.

[15] Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G,et al. Integrative genomics viewer. Nature Biotechnol 2011;29(1):24–6.http://dx.doi.org/10.1038/nbt.1754.

[16] Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer(IGV): high-performance genomics data visualization and exploration. BriefBioinform 2013;14(2):178–92. http://dx.doi.org/10.1093/bib/bbs017.

[17] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A,et al. The genome analysis toolkit: a mapreduce framework for analyzingnext-generation DNA sequencing data. Genome Res 2010;20(9):1297–303.http://dx.doi.org/10.1101/gr.107524.110.

[18] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. Aframework for variation discovery and genotyping using next-generationDNA sequencing data. Nat Genet 2011;43(5):491–8. http://dx.doi.org/10.1038/ng.806.

[19] Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al. From fastq data to high confidence variant calls:the genome analysis toolkit best practices pipeline. Curr Protoc Bioinform2013;43(1110):11. http://dx.doi.org/10.1002/0471250953.bi1110s43.

[20] Zenz T, Kröber A, Scherer K, Häbe S, Bühler A, Benner A, et al. Monoallelictp53 inactivation is associated with poor prognosis in chronic lympho-cytic leukemia: results from a detailed genetic characterization withlong-term follow-up. Blood 2008;112(8):3322–9. http://dx.doi.org/10.1182/blood-2008-04-154070.

[21] Zenz T, Eichhorst B, Busch R, Denzel T, Häbe S, Winkler D, et al. TP53mutation and survival in chronic lymphocytic leukemia. J Clin Oncol OffJ Amer Soc Clin Oncol 2010;28(29):4473–9. http://dx.doi.org/10.1200/JCO.2009.27.8762.

[22] Scheinin I, Sie D, Bengtsson H, van de Wiel MA, Olshen AB, vanThuijl HF, et al. DNA copy number analysis of fresh and formalin-fixedspecimens by shallow whole-genome sequencing with identification andexclusion of problematic regions in the genome assembly. Genome Res2014;24(12):2022–32. http://dx.doi.org/10.1101/gr.175141.114.

154

Paper IV

155

Leukemia Research Reports 15 (2021) 100255

Available online 1 June 20212213-0489/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Distal chromosome 1q aberrations and initial response to ibrutinib in central nervous system relapsed mantle cell lymphoma

Marcus Høy Hansen a,*, Karen Juul-Jensen a, Oriane Cedile a, Stephanie Kavan b, Michael Boe Møller a, Jacob Haaber a,1, Charlotte Guldborg Nyvold a,1

a Haematology-Pathology Research Laboratory, Research Unit for Haematology and Research Unit for Pathology, University of Southern Denmark and Odense University Hospital, Odense, Denmark b Department of Clinical Genetics, Odense University Hospital, Denmark

A R T I C L E I N F O

Keywords: Mantle cell lymphoma Central nervous system relapse Copy number alteration Chromosome 1 aberration Ibrutinib response

A B S T R A C T

Relapse involving the central nervous system (CNS) is an infrequent event in the progression of mantle cell lymphoma (MCL) with an incidence of approximately four percent. We report four cases of MCL with CNS relapse. In three of the four patients a large chromosomal copy-number alteration (CNA) of 1q was demonstrated together with TP53 mutation/deletion. These patients experienced brief response to ibrutinib, whereas a fourth patient harboring mutated ATM demonstrated a long-term effect to ibrutinib and no CNA. Although it is unclear whether chromosome 1q CNA contribute to specific phenotypes these reports may be of value as such lesions are uncommon features of MCL.

1. Introduction

Infiltration of the central nervous system (CNS) is an infrequent event in the progression of mantle cell lymphoma (MCL) and is esti-mated to occur in 4% of patients with primary systemic disease. [2, 3]. Taking into account the relative rarity of this malignancy, which is generally reported to be less than 10% of the non-Hodgkin’s lym-phomas, it is difficult to shed light on specific molecular lesions, if any, leading to the progression of CNS involvement. The severity of CNS progression, with the survival of only a few months, emphasizes the importance of further investigations into the molecular biology.

In this report, we briefly describe the response to ibrutinib in cases of MCL with CNS involvement as well as a recurrent observation, not previously reported to this extent. While cytogenetic changes are frequent in MCL both at diagnosis and relapse, with translocation t (11;14) being a prominent hallmark of the disease, somatic copy number alterations of chromosome 1q are, to our knowledge, very rare. Yet, we successively report four whole-exome sequenced patients, of which three evidently display large aberrations of chromosome 1, two being deletions and one a copy gain. Two of the patients progressed with 1q copy alterations at CNS relapse detected in cerebrospinal fluid (CSF), while another carried the deletion at diagnosis (Table 1). Patients 1, 3,

and 4 were found to have acquired large 17p deletions by assessment of variant allele frequencies (VAF). These findings were accompanied by mutations in TP53 for patients 1 and 2 at diagnosis (Table 1 and 2).

2. Patient cases

All four patients had a consistent immunophenotype evaluated by flow cytometry (FC) from diagnosis to relapse, being positive for CD5, CD19, CD20, CD22, and either kappa or lambda expression, while showing CD10, CD23, and CD200 negativity (Table 1). Additional in-dividual markers were employed in FC and immunohistochemistry staining for histopathological confirmation of MCL, and exclusion of CLL or other lymphomas. In the same manner, CCND1 and PAX5 expression was consistently evaluated at diagnosis, together with SOX11, and at relapse. All patients received rituximab as part of the first-line treatment regimens in combination with bendamustine (patient 1) and R-CHOP (patient 2, 3, 4), alternating with rituximab and high-dose cytarabine followed by autologous stem cell transplantation (patient 2, 3). All pa-tients initially responded to ibrutinib treatment after CNS relapse, evaluated by the clinical response in three patients and by CNS cytor-eduction in one patient who did not have clinical symptoms. Time to response varied from 4 days to 4 weeks.

* Corresponding author. E-mail address: [email protected] (M.H. Hansen).

1 Contributed equally

Contents lists available at ScienceDirect

Leukemia Research Reports

journal homepage: www.elsevier.com/locate/lrr

https://doi.org/10.1016/j.lrr.2021.100255 Received 5 January 2021; Received in revised form 14 May 2021; Accepted 23 May 2021

156

Leukemia Research Reports 15 (2021) 100255

2

The level of details for each report is deliberately decreased, successively, due to space limitations and scope. The status of chromosome 1q aberrations and somatic mutations were unknown at the time of sampling, and the patient cases were selected and sequenced for this study solely on the basis of receiving ibrutinib following CNS relapse. Methods are found in the online supplement together with a complete list of somatic mutations.

Patient 1: Male, age 72 at MCL diagnosis ultimo 2015 (stage IV B, Ann Arbor classification, Table 1). Malignant B cells from the inguinal lymph node were assessed pleomorphic at diagnosis, staining positive for CCND1, SOX11, PAX5, and BCL2 by immunohistochemistry in addition to previously described surface markers, while being BCL6 negative. FC detected approximately 68% lymphoma cells in a lymph node and 30% in bone marrow (BM). Complete remission was attained following treatment with rituximab and bendamustine. Onset of nausea, visual disturbances, and aphasia occurred within a year of the diagnosis. Only cells with normal karyotype were found in BM at relapse, while CSF contained an excess of 300,000 B cells/ml with meningeal involvement demonstrated by cerebral magnetic resonance imaging. Treatment with ibrutinib and prednisolone was initiated resulting in reestablished vision, nonimpaired speech and movements after a week together with a marked decrease in the number of cells in spinal fluid to 85,000 cells/ml. Neurological symptoms reappeared after 4 weeks, and the patient pro-gressed ad mortem 8 weeks after CNS relapse. A high burden deletion (~80%) was found at relapse on chromosome 1q (approximately 95 Mb, Fig. 1, patient 1 panel D, and Fig. 2) by whole-exome sequencing (WES). UBR5 (8:103,269,922 G>A, nonsense mutation, GRCh37) and ERBB2 (17:37,865,708 A>G, intron-exon splice junction) were shared between diagnostic lymph node, BM, and relapse CSF (Table 2). Both a TP53 mutation and 17p deletion were found by sequencing of diagnostic sample.

Patient 2: Woman, age 61, diagnosed following inguinal biopsy (stage IV B). The biopsy was positive for SOX11. The patient showed approx-imately 80% lymphocytes in BM by microscopic examination of imprint slides and was positive for PAX5 and CCND1 by immunohistochemistry. R-CHOP based immuno-chemotherapy was initiated and high-dose therapy and autologous stem cell transplantation (HDT+ASCT) was completed six months after diagnosis. CNS relapse occurred 10 months after diagnosis. CNS involvement was proven in CSF shortly after partial remission in BM (~1% residual MCL cells) with a concentration of well over a thousand cells per milliliter with approximately half of the leu-cocytes being lymphoma cells evaluated by FC. Ibrutinib treatment was initiated as second-line treatment, followed by immediate cytoreduction from 1200 cells/ml to 100 cells/ml sustained ad mortem. The patient succumbed during invasive aspergillosis one and a half month after CNS relapse. The B cell lymphoma displayed double mutated TP53 (Table 2),

Table 1 Patient characteristics. LDH: lactate dehydrogenase, U/L: units per liter, NA: not available, HDT+ASCT: High-dose therapy and autologous stem cell trans-plantation, CSF: cerebrospinal fluid, CNS: central nervous system, WES: analyzed with whole-exome sequencing, ~: approximately, *: SOX11 positivity was inconclusive at diagnosis but positive in the bone marrow two years later and prior to the first treatment regime, #: detected at diagnosis, $: detected at relapse. The chromosome 1q copy number alterations were found in CSF at relapse for patients 1 and 3, and in diagnostic lymph node for patient. 2 (sequencing of CSF at relapse failed with an exhaust of material).

Patient 1 2 3 4

Age at diagnosis (year) 72 (2015)

61 (2015) 62 (2014)

72 (2012)

Stage IV B IV B IV A IV A LDH (U/L) Elevated

(228) Elevated (244)

Normal (123)

Normal (158)

MIPI score 7,0 7,9 5,9 6,8 Ki67 45 90 NA 10 HDT+ASCT – + + – Intracerebral lesion + – + – Intraspinal lesion – + + – Relapse CSF involvement + + + – Relapse CNS involvement + + + +Ibrutinib response

duration at CNS relapse <1 month

~1 month ~1 month

> 60 months

Status at last follow-up Dead Dead Dead Dead Ibrutinib toxicity – Inversive

aspergillus – Pleura

effusion CD5, CD19, CD20, CD22 + + + +CD10, CD23, CD200 – – – – SOX11 + + +* +CCND1 + + + +TP53 deletion (WES) +#$ – +$ +$

TP53 mutation (WES) +# +# – – 1q copy number

alteration (WES) +$ +# +$ –

Table 2 Detected somatic mutations. Mutations overlapping with the Cancer Gene Census list provided by the Catalogue of Somatic Mutations in Cancer (COSMIC, v92, Tier 1 and 2) [11] are shown. The mutations were detected with Mutect2 (flagged PASS), with sequencing alignment in BWA (GRCh37) and processing in GATK 4.1.8 (Broad Institute, Cambridge, MA, USA) [10]. All mutations with a depth of coverage below 20 were discarded. Pt.: Patient number. DP: Depth of coverage. NA: Not available. ND: Not detected. Relapse denotes relapse cerebrospinal fluid for patients 1 and 3, relapse bone marrow for patient 2 with almost no detectable malignant cells (1%, see Fig. 1. Note that sequencing of relapse CSF failed for pt. 2), in agreement with mutational status, and relapse dorsal tumor for patient 4.

Pt. Gene Position Identifier Diagnosis DP (Lymph node) Diagnosis (Bone marrow) Control DP Relapse DP

1 BCORL1 chrX:129,148,676 C>A ND ND ND 6/31 (19%) CARD11 chr7:2,985,521 A>T COSV62717189 50/193 (26%) 55/420 (13%) ND ND EBBR2 chr17:37,865,708 A>G (splice junction) 26/53 (49%) 10/70 (14%) ND 14/31 (45%) PBRM1 chr3:52,610,594 A>AC 33/93 (35%) 30/263 (11%) ND ND SMARCA4 chr19:11,123,693 G>A COSV60798909 ND 14/129 (11%) ND 14/46 (30%) TP53 chr17:7,577,515 T>G COSV52801994 47/81 (58%) 37/198 (19%) ND ND UBR5 chr8:103,269,922 G>A 30/77 (39%) 44/231 (19%) ND 43/73 (59%)

2 EGFR chr7:55,087,012 G>T 56/154 (56%) ND ND NA NUP98 chr11:3,793,053 C>T rs148092095 185/370 (50%) ND ND NA SF3B1 chr2:198,261,029 A>T 62/203 (31%) 5/193 (3%) ND NA TP53 chr17:7,577,541 T>A COSV52732730 144/331 (44%) 4/212 (2%) 1/164 (1%) NA TP53 chr17:7,578,546 AG>A 237/299 (79%) 5/323 (2%) ND NA

3 CCND1 chr11:69,457,900 G>A NA ND ND 88/109 (81%) CSMD3 chr8:113,694,865 G>A NA 22/153 (14%) ND 50/117 (43%) NSD2 chr4:1,962,801 G>A COSV56386422 NA 31/266 (12%) 1/567 (0%) 150/237 (63%) ROBO2 chr3:77,681,753 G>C NA 22/269 (8%) 1/349 (0%) 50/156 (32%) SLC45A3 chr1:205,632,142 C>A NA 8/29 (28%) ND ND

4 ATM chr11:108,216,614 AG…>A NA 27/81 (33%) ND 27/89 (30%) DDX5 chr17:62,500,098 TACAG>T rs782442161 NA 2/562 (0%) ND 102/134 (76%) IRS4 X:107,977,919 G>T rs199512071 NA 7/47 (15%) ND ND PTPRK chr6:128,505,823 T>C COSV100949489 NA ND ND 51/138 (37%)

M.H. Hansen et al.

157

Leukemia Research Reports 15 (2021) 100255

3

negative for 17p deletion. In addition, a distal 1q deletion of approxi-mately 100 Mb was detected utilizing VAF and read depth ratio analyses of tumor-control pair (Fig. 1, patient 2, panel A), with an estimated cell fraction of 25–30%.

Patient 3: Male, age 62, diagnosed MCL (stage IV A) with 32% lym-phoma cells evaluated by FC. The patient was treated with R-CHOP after 2 years of “watch and wait” approach. HDT+ASCT was completed six months after the start of therapy. PAX5 and CCND1 positivity was confirmed, while SOX11 was inconclusive at diagnosis and confirmed positive two years later at treatment initiation. TP53 expression was absent, and the karyotype was evaluated as being normal at diagnosis. CNS relapse occurred two years after diagnosis with 89% lymphoma cells in CSF, dominated by CD19, CD20, and kappa light chain positive cells. Several mutations were found to be shared between diagnosis and CNS relapse (Table 2). This patient also displayed a large allelic

imbalance (Fig. 1, patient 3, panel C) of 95–100 Mb at the same location as patients 1 and 2 (Fig. 2), however showing evidence of a chromosome copy-gain (Fig. 3, patient 3). The estimated burden from variant allele frequencies was 60%. Ibrutinib treatment was initiated shortly after CNS relapse with a reduction of CSF leukocytes from 170,000 to 39,000 cells/ ml within three weeks. The patient progressed four weeks after relapse was identified and died five weeks after.

Patient 4: Male, age 72, diagnosed with MCL (stage IV A) followed by induction treatment with R-CHOP. Relapse occurred 2 years after diagnosis, where PET/CT showed bilateral intraocular tumor infiltra-tion, and a dorsal tumor was detected with a high number of clonal B cells (93% by FC). Assessment of the infiltration of corpus vitreum by fundoscopy confirmed the CNS relapse (Fig. 4), while no malignant cells were present in the CSF. The patient responded immediately to ibrutinib treatment and attained complete remission, examined by PET/CT two

Fig. 1. Allelic imbalance of chromosome 1q was observed in three of the four mantle cell lymphoma cases: in two at relapse in the central nervous system (patient 1 and 3, cerebrospinal fluid (CSF)) and one at diag-nosis (patient 2). Each individual plot displays the allele frequencies of heterozygous variants (VAF) on chromosome 1 and significant devia-tion, or imbalance, from normal disomy. In all three cases, the aberration spanned almost 100 megabases distal to the centromere, and thus the majority of the q-arm. Sequencing of patient 2 CSF relapse material failed and thus could not be evaluated (*). All three evident allelic im-balances on the q-arm were significantly different from the p-arm (***, Wilcoxon rank- sum test:p ≤ 10−4, Kolmogorov-Smirnov test of equal distributions: p ≤ 10−4).

Fig. 2. The size of the allelic imbalance on chromosome 1q was found to be approximately 100 megabases distal to the centromere across the three patients with apparent copy-alterations (the gradient marks the approximate starting point of the lesion of the four patients, as it is not possible to conclude the exact position from the whole-exome sequencing resolution).

M.H. Hansen et al.

158

Leukemia Research Reports 15 (2021) 100255

4

months after onset. Despite recurrent pleura effusions, the patient remained in remission beyond 60 months. Measurable residual disease was below 0.04%, evaluated by FC, five years after diagnosis with an immunophenotype concordant with that of diagnosis. The most notable finding from WES was a double mutation in ATM, present at diagnosis and relapse, and 17p deletion (Table 2). No copy number alteration on chromosome 1 was detected.

3. Discussion

We have presented four cases of MCL, of which three were struck by an allelic imbalance/copy-alteration on chromosome 1q and displayed a very brief response to ibrutinib at CNS relapse. The treatment response was only sustained in one of the patients (pt. 4). This patient was devoid of malignant cells in CSF at CNS relapse and did not harbor a chromo-some 1q loss or gain in the investigated materials. The WES copy- number profiling was based on the detection of allelic imbalances by variant allele frequencies and relative read-depth ratios, as previously implemented in the molecular profiling of four other patients with MCL using WES [6] (see also [7] and [5]), with the exception that the coverage of chromosome 1p arm was used for internal confirmation of the 1q copy number alteration, and to circumvent sequencing batch effects. All read depth ratios were normalized to that of the respective paired control.

While the knowledge supplements the current reports on MCL found within the CNS, by the fact that chromosome 1q copy number alterations are largely undescribed, it remains unclear whether the lesions, comprising almost the entire chromosomal q-arm, contribute to any phenotypic attributes of CNS infiltrating MCL or whether these play any role in the treatment

resistance to ibrutinib. Chromosome 1p or 1q deletions and amplifications have been reported in other lymphomas [1, 4, 8] but extensive 1q copy-number alterations seem to be rare in MCL. A few cases of 1q de-letions have previously been demonstrated in two MCL cell lines and five patient samples with single nucleotide polymorphism genomic micro-arrays [9], which are capable of finding both copy altering and copy-neutral events by allelic imbalance, but not to definitely deduce the copy state on its own. Such 1q aberrations are thus candidates for further investigations. We hope that this report may help other re-searchers to progress towards a better understanding of this often-aggressive disease.

Data availability

Somatic mutation candidates are found in the online supplement (annotated Mutect 2 output, GATK 4.1.8.1)

Ethical approval

The project was approved by the National Ethical Committee in Denmark (Approval no. 1,605,184).

Author’s contribution

All authors contributed to this work.

Declaration of Competing Interest

The authors have nothing to disclose.

Fig. 3. Coverage of the chromosome 1 q-arm region afflicted by allelic imbalance relative to the unaffected p-arm. Chromosome 1q of patient (pt.) 1 and 2 were evident of a copy-loss (A, dark gray), while a copy-gain was indicated for patient 3 (dark gray). All read depth ratios were normalized to that of the respective paired control. Samples without statistically signifi-cant allelic imbalance on the q-arm (white) were in close agreement with the paired control samples. BM: bone marrow, LN: lymph node, CSF: cerebrospinal fluid, PF: pleura fluid, DT: Dorsal tumor. The chromosome 1q imbalance observed in patient 1, 2 and, 3 were supported by qPCR (TaqMan Copy Number Assay) tar-geting the ABL2 gene (chromosome 1q25.2) and the reference gene TERT (chromosome 5p15.33).

Fig. 4. Fundoscopy showing infiltration of malignant cells in corpus vitreum at relapse (left and right eye). Patient 4 experienced a CNS relapse without cerebrospinal clonal B cells. The patient was still alive after 60 months, and the malignant cells did not show any aberrations on chromosome 1 in contrast to the other cases.

M.H. Hansen et al.

159

Leukemia Research Reports 15 (2021) 100255

5

Acknowledgements

We thank professor Mads Thomassen, Department of Clinical Ge-netics, Odense University Hospital, Denmark, for the assistance in relation to Next-generation sequencing. Also, we are grateful for the support provided by the foundations Grosserer Brogaard og Hustrus Mindefond, Dagmar Marshalls Fond, A. J. Andersen og hustrus Fond, and Eva og Henry Frænkels Mindefond.

Supplementary materials

Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.lrr.2021.100255.

References

[1] T.F. Barth, H. Dohner, C.A. Werner, S. Stilgenbauer, M. Schlotter, M. Pawlita, P. Lichter, P. Moller, M. Bentz, Characteristic pattern of chromosomal gains and losses in primary large b-cell lymphomas of the gastrointestinal tract, Blood 91 (1998) 4321–4330.

[2] S. Bernard, L. Goldwirt, S. Amorim, P. Brice, J. Briere, E. de Kerviler, S. Mourah, H. Sauvageon, C. Thieblemont, Activity of ibrutinib in mantle cell lymphoma patients with central nervous system relapse, Blood 126 (2015) 1695–1698.

[3] C.Y. Cheah, A. George, E. Gine, A. Chiappella, H.C. Kluin-Nelemans, W. Jurczak, K. Krawczyk, H. Mocikova, P. Klener, D. Salek, J. Walewski, M. Szymczyk, L. Smolej, R.L. Auer, D.S. Ritchie, L. Arcaini, M.E. Williams, M. Dreyling, J. F. Seymour, European Mantle Cell Lymphoma, N, Central nervous system involvement in mantle cell lymphoma: clinical features, prognostic factors and outcomes from the european mantle cell lymphoma network, Ann. Oncol. 24 (2013) 2119–2123.

[4] J.L. Garcia, J.M. Hernandez, N.C. Gutierrez, T. Flores, D. Gonzalez, M.J. Calasanz, J.A. Martinez-Climent, M.A. Piris, C. Lopez-Capitan, M.B. Gonzalez, M.D. Odero, J. F. San Miguel, Abnormalities on 1q and 7q are associated with poor outcome in sporadic burkitt’s lymphoma. a cytogenetic and comparative genomic hybridization study, Leuk. 17 (2003) 2016–2024.

[5] M.C. Hansen, L. Nederby, E. Kjeldsen, M.A. Petersen, H.B. Ommen, P. Hokland, Case report: exome sequencing identifies t-all with myeloid features as a ikzf1- struck early precursor t-cell malignancy, Leuk. Res. Rep. 9 (2018) 1–4.

[6] M.H. Hansen, O. Cedile, M.K. Blum, S.V. Hansen, L.H. Ebbesen, H.H.N. Bentzen, M. Thomassen, T.A. Kruse, S. Kavan, E. Kjeldsen, T.K. Kristensen, J. Haaber, N. Abildgaard, C.G Nyvold, Molecular characterization of sorted malignant b cells from patients clinically identified with mantle cell lymphoma, Exp. Hematol. 84 (7–18) (2020) e12.

[7] M.H. Hansen, P. Hokland, C.G. Nyvold, CNAplot — software for visual inspection of chromosomal copy number alteration in cancer using juxtaposed sequencing read depth ratios and variant allele frequencies, Sw.X. 11 (2020).

[8] J.M. Hernandez, J.L. Garcia, N.C. Gutierrez, M. Mollejo, J.A. Martinez-Climent, T. Flores, M.B. Gonzalez, M.A. Piris, J.F. San Miguel, Novel genomic imbalances in b-cell splenic marginal zone lymphomas revealed by comparative genomic hybridization and cytogenetics, Am. J. Pathol. 158 (2001) 1843–1850.

[9] N. Kawamata, S. Ogawa, S. Gueller, S.H. Ross, T. Huynh, J. Chen, A. Chang, S. Nabavi-Nouis, N. Megrabian, R. Siebert, J.A. Martinez-Climent, H.P. Koeffler, Identified hidden genomic changes in mantle cell lymphoma using high-resolution single nucleotide polymorphism genomic array, Exp. Hematol. 37 (2009) 937–946.

[10] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, M.A. DePristo, The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data, Genome. Res. 20 (2010) 1297–1303.

[11] J.G. Tate, S. Bamford, H.C. Jubb, Z. Sondka, D.M. Beare, N. Bindal, H. Boutselakis, C.G. Cole, C. Creatore, E. Dawson, P. Fish, B. Harsha, C. Hathaway, S.C. Jupe, C. Y. Kok, K. Noble, L. Ponting, C.C. Ramshaw, C.E. Rye, H.E. Speedy, R. Stefancsik, S. L. Thompson, S. Wang, S. Ward, P.J. Campbell, S.A. Forbes, COSMIC: the catalogue of somatic mutations in cancer, Nucle. Acid. Res. 47 (2019) D941–D947.

M.H. Hansen et al.

160

Paper V

161

PERSPECTIVE

Perspective: sensitive detection of residual lymphoproliferative diseaseby NGS and clonal rearrangements—how low can you go?

Marcus H. Hansena,b, Oriane C!edilea,b, Thomas S. Larsenb, Niels Abildgaarda,b, andCharlotte G. Nyvolda,b

aHematology−Pathology Research Laboratory, Research Unit for Hematology and Research Unit for Pathology, University of Southern

Denmark and Odense University Hospital, Odense, Denmark; bDepartment of Hematology, Odense University Hospital, Odense, Denmark

(Received 2 February 2021; revised 22 March 2021; accepted 30 March 2021)

Malignant lymphoproliferative disorders collectively constitute a large fraction of thehematological cancers, ranging from indolent to highly aggressive neoplasms. Being adiagnostically important hallmark, clonal gene rearrangements of the immunoglobulinsenable the detection of residual disease in the clinical course of patients down to a min-ute fraction of malignant cells. The introduction of next-generation sequencing (NGS)has provided unprecedented assay specificity, with a sensitivity matching that of poly-merase chain reaction-based measurable residual disease (MRD) detection down to the10−6 level. Although reaching 10−6 to 10−7 is theoretically feasible, employing a suffi-cient amount of DNA and sequencing coverage is placed in the perspective of the prac-tical challenges when relying on clinical samples in contrast to controlled serialdilutions. As we discuss, the randomness of subsampling must be taken into account toaccommodate the sensitivity threshold—in terms of both the required number of cellsand sequencing coverage. As a substantial part of the reviewed studies do not state thedepth of coverage or even amount of DNA in some cases, we call for increased trans-parency to enable critical assessment of the MRD assays for clinical implementationand feasibility. © 2021 ISEH – Society for Hematology and Stem Cells. Published byElsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)

The minimal/measurable residual disease (MRD) con-cept and its use in hematology have been reinforced inthe last couple of years as next-generation sequencing(NGS) has become more accessible to clinical laborato-ries after a continuous increase in sequencing capacityand lowered costs. The introduction of NGS has pro-vided improved resolution of the nucleotide composi-tion of the gene rearrangements in both B and T cells,making it highly effective for longitudinal surveillanceof a single but also multiple coexisting clones. Muchwork has already focused on superseding the high sen-sitivity provided by quantitative polymerase chain

reaction (qPCR) while increasing assay specificity andcircumventing the need for patient-specific PCR pri-mers/probes. Because clonal gene rearrangements pro-vide specific signatures of both B-cell receptor (BCR)immunoglobulins and T-cell receptor (TCR) chains,these target sequences are of immense importance inthe diagnostic workup, and sometimes prognostically,because they can serve as a target for MRD quantifica-tion in the setting of malignant lymphoproliferative dis-orders. A frequent notion on assay sensitivity relates tothe capability of detecting residual malignant cellsamong normal counterparts down to one malignant cellout of a million healthy cells and potentially beyond.

Constituting a key component of the adaptiveimmune system, the B lymphocytes are highly special-ized in antigen presentation, BCR antigen binding, and

Offprint requests to: Marcus Høy Hansen, Department of Hema-

tology, Odense University Hospital, J.B. Winsløws Vej 15, 3rd floor,

Odense 5000, Denmark; E-mail: [email protected]

0301-472X/© 2021 ISEH – Society for Hematology and Stem Cells. Published by Elsevier Inc. This is an open access article under the CC BY license

(http://creativecommons.org/licenses/by/4.0/)https://doi.org/10.1016/j.exphem.2021.03.005

Experimental Hematology 2021;98:14−24

162

mounting a humoral response by the secretion of anti-bodies. The flexibility and extensive diversity of theimmunoglobulins are generated primarily by gene rear-rangements of the light and heavy chain loci. The pro-found implication is that the genomes of differentiatinglymphocytes are not static in these regions, introducingirreversible genomic changes. It is this highly specificsignature of the immunoglobulins that is exploited indiagnostics and MRD laboratory assays (Figure 1).After the naive B cell encounters an antigen, it mayundergo additional somatic hypermutations. Althoughclonal proliferation also represents a normal counterac-tion inherent to the immune response, acute or chronicexcessive expansion can be the result of a malignant clonaltransformation, with a distinct clonotype constituting thelymphoproliferative disorder. The motivation for perform-ing NGS analyses of clonal gene rearrangements is both toidentify the presence of clonal lymphoproliferation by aspecific unique gene rearrangement at diagnosis and todetect a minute amount of residual disease of this clono-type during clinical remission or as an early sign of relapseat follow-up.

In this perspective, we critically approach the topicof sensitivity in the setting of MRD detection bysequencing of clonal immunoglobulin heavy chain(IGH) gene rearrangements (Figure 1) in lymphoid neo-plasms. We evaluate how sensitive measurable diseasedetection is being performed by ultrahigh sequencingcoverage, potential technical problems to consider, andbiological pitfalls to establish a more nuanced dis-course of the sensitivities. Foremost, we address thetheoretical amount of DNA and sequencing readsneeded to confidently detect MRD at a given sensitivitylevel and factors that may influence the resolution.

Although the topics are narrowed to IGH gene rear-rangements, the general discussion also pertains to lightchain and T-cell receptor gene rearrangements. Forgeneral reviews on the topic of MRD detection bymeans of clonal gene rearrangements, we instead referto previously published general overviews [1−6].

Detection of residual lymphoproliferative diseaseIt has long been recognized that MRD during remissionis of prognostic value, hence there is a strong incentivefor pushing the limit of detection further to detect evensmaller amounts of clonal cells than possible with cur-rent qPCR techniques. The topic has been exploredextensively throughout the last two decades, mostrecently by NGS owing to its scalability, specificity,sensitivity, and, equally important, the avoidance ofpatient-specific assays. The undetectability of MRDafter intervention in acute lymphoblastic leukemia(ALL) provides an improved measurement of favorableprognosis when evaluated by NGS in comparison toflow cytometry [7,8]. Although ALL, and lately multi-ple myeloma (MM), has received much of the atten-tion, the use is widening to other malignancies, such aschronic lymphocytic leukemia (CLL) [9]. Thompsonet al. [9] reported that the majority of patients foundMRD negative by flow cytometry had measurable dis-ease when evaluated with NGS down to the 10−6 level,but they also noted that the recent developments inflow cytometry may provide a resolution comparable tothat of NGS. Patients with undetectable MRD usingNGS were found to have superior progression-free sur-vival [9]. Moreover, MRD was evaluated to be a deter-mining factor in the risk of relapse after stem celltransplantation [10], using TCR and immunoglobulingene rearrangements as MRD targets using PCR inpatients with relapsed ALL. Patients with less than 1 in10,000 malignant cells had a higher probability ofevent-free survival and a lower cumulative incidenceof relapse. One interesting finding of these resultsrelated to a “dose-dependent” relation between thelevel of residual malignant cells and event-free survival(Figure 1).

Before the implementation of NGS, qPCR provideda leap in the detection of clonal gene rearrangement,normally reported within the range of 10−4 to 10−6

[11], compared with DNA fragment analyses. BeforeqPCR, Southern blots had been the gold standard. In athorough introduction to the topic of clonal generecombinations concerning the standardization of PCRprotocols (BIOMED-2), van Dongen et al. [12]assessed the sensitivity of Southern blots to be in therange 5%−10% but cumbersome. That article providedrecommendations for PCR that are still highly relevantin the age of NGS. Although both immunoglobulins(IGH, IGK, and IGL) and T-cell receptor genes

Figure 1. The complementarity-determining regions (CDR1−3) of

the immunoglobulin variable domain (VH or VL) provide the diver-

sity for antigen binding. The hypervariability directly stems fromthe DNA rearrangement of the variable (V), diversity (D), and join-

ing (J) genes in the heavy chain locus and the V and J genes in the

immunoglobulin light chain (not shown here), whereas the constant

regions of the immunoglobulin (CL or CH) do not provide specific-ity toward an antigen. To amplify parts of the rearranged locus, the

PCR primers (! or ) target conserved framework regions in the

leader sequence present in the beginning of VH (not shown) or in

the VH genes (FR1−3) and JH genes (FR4).

M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24 15

163

(TCRB, TCRD, and TCRG; TCRA was not included inPCR) were addressed in the study, one important factwas that not all expanded clonal gene rearrangementsarose from an underlying malignancy. This argument isof course central to the strategy of mapping diagnostic clo-notypes before reliable MRD measurements can beachieved, regardless of the assay type. Gene rearrange-ments are not directly lineage specific, as aptly exemplifiedby immature acute B- and T-cell leukemias [13−15], andnot all lymphocytic neoplasms necessarily have a detect-able rearrangement with the applied technologies. Inapproximately 80%−90% of the studies referred to here, agene rearrangement is identified [16−18]. As such, a cen-tral part of the NGS MRD assay is to identify the primaryclonal gene rearrangement, that is, the clonotype, at diag-nosis to implement in follow-up quantification. Failure todetect any diagnostic clone may arise from primer issues,biologically incomplete gene rearrangement [19], or possi-bly immature malignancies devoid of rearranged immuno-globulin.

Although the concept of MRD has existed and beenused extensively in hematology for several decades, theunderlying foundation, methods, and calculations havechanged and lack standardization when relying on NGSmodalities. Currently, it is assumed that NGS has a clinicalcapability of detection down to the 10−7 sensitivity level[6], although such achievements have not fully been dem-onstrated in practice. This level of sensitivity may eventu-ally be demonstrated empirically, without the use ofextrapolating standard curves from serial dilutions, whichis statistically feasible when larger cohorts are analyzed.However, its general applicability remains somewhat spec-ulative, especially because of the increased cost of reach-ing such a high sensitivity.

Recently, Yao et al. [20] reported concordancebetween allele-specific oligonucleotide qPCR (ASO-qPCR) and NGS, similar to that described in earlierstudies [21], in terms of MRD negativity and positivitydetection but also comparable dynamic ranges. Theauthors provided information on the sequencing depthand the rationale for obtaining 10−5 by employing trip-licates of 1 mg each [20] and commented critically onarticles pushing the sensitivity two orders of magnitudedown to the 10−7 level. Considering the notion that“sequencing assay sensitivity is limited only by thenumber of input cells and thus can detect residual dis-ease at levels well below 1 in 1 million leukocytes”[22], one may question how sequencing coverage isbeing taken into account. The number of reads was inthis particular article stated as 106, and thus, Poissonsampling may have a detrimental effect on the assay. Itcannot be assumed that the DNA of the cells will beconverted to sequencing reads in a one-to-one manner.However, if the sensitivity is based on a total numberof, for example, nucleated cells, then the identification

of a clonal sequence among background immunoglobu-lin gene rearrangements may suffice without reachingthe aforementioned coverage threshold.

A proper sensitivity is not just detecting one in amillion cellsOn one hand, it may important to ask why the sensitiv-ity cannot go lower than currently reported, and on theother hand, how such a threshold is feasible in practi-cal terms and how it is evaluated [6]. Although sensi-tivities of 10−6 are reported in the literature (Table 1),stable detection of one clonal cell in a million is,empirically, another issue. Neither does sporadic MRDnegativity at 10−6 correspond to an established assaysensitivity at this level, nor do serial dilution assays,for example, demonstrating linearity down one in amillion, provide applicable or robust assays for theclinical laboratory. Another issue relates to the biologi-cal aspects. As Thompson et al. [9] briefly note, MRDvalues below 10−6 do not necessarily indicate a cure[9]. It is impossible, logically, to establish a limit atwhich long-term event-free survival can be guaranteed.

Several criteria must be addressed before achievinghigh sensitivity. MRD assessment based on detectionof somatic single or short nucleotide variants is limitedby the error rate of the sequencing platform [36] and,thus, is potentially detrimental to the specificity of theassay. This poses less of a problem in the detection ofB- or T-cell gene rearrangements, which rely on a longsequence as the clonal fingerprint. However, it doesstill relate to the number of base mismatches to acceptwhen including reads in the cumulative clonal fre-quency. It is known that the effective error rate is notconstant across the length of the DNA sequence. Cur-rently, there is no consensus on when to exclude readsfrom the detected clonotype. As an example, the Lym-phoTrack assay allows up to two mismatches (Instruc-tions for Use, Nos. 280364 and 280473, Invivoscribe,San Diego, CA, USA).

Role of PCR, stochastic subsampling, and the impacton sensitivitiesThe process of drawing a liquid or tissue biopsy,through DNA extraction, PCR, and NGS to the finalcomputational output of clonal sequences comprisesdiscrete steps that contribute to the final resolution.Much effort has been exerted into understanding thekinetics of polymerase chain reactions, in which theamplification of DNA in most simple terms can bedescribed as an exponential growth rate, depending onthe starting amount, amplification efficiency, and num-ber of thermocycles (Figure 2A). One oversimplifica-tion of this model relates to the PCR amplification ofimmunoglobulin stretches in the diagnostic sample,which involves multiple primers targeting consensus

16 M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24

164

sequences in the leader region and FR1, FR2, FR3, andFR4 regions (Figure 1). Such multiplex reactions withmultiple templates may entail both different startingconcentrations and varying efficiencies. Thus, the PCRend product can be considered as a sum of both identi-cal, clonal sequences and nonidentical, backgroundsequences of amplified DNA as exemplified inFigure 2A (Eq. 1). Although the ratio of amplified IGHgene rearrangement from the malignant clone to thatfrom the background is expected to be constant beforeand after PCR amplification (Figure 2A, Eq. 2), onepotential consequence of analyzing minute amounts oftarget DNA from a patient in remission or molecularMRD is that the sample or replicate may contain zerocopies. Also, uneven amplification may occur when theclonotype amplification efficiency differs from that ofthe immunoglobulin background. With each subsequentstep that makes up an MRD assay expected to follow aPoisson distribution (Figure 2A, Eq. 3), the overall riskof failing to demonstrate MRD increases dramaticallywith a small amount of input and subsequent subsam-pling (Figure 2B). Basically, and most importantly, thePoisson distribution helps to predict the probability offailing to detect MRD given the known mean numberof targets. Because this is actually not known, it isfixed as the intended sensitivity level, and then thisprobability distribution is implemented to theoreticallypredict the amount of DNA needed. Undoubtedly, the

amplification efficiency and error rate will also affectthe sensitivity to some degree. In circumstances whenthe amplification efficiency is lower for the clonalsequence, the chain reaction may be rapidly skewedduring the PCR cycles (Figure 2A, Eq. 4).

One practical feature of the clonotype assays per-tains to the high specificity; because these assays relyon unique clonal sequences, they are very unlikely toproduce false-positive results, in contrast to MRDusing single-nucleotide variants, which is more fre-quently employed in myeloid disorders. Instead, thespecificity is comparable to that of FLT3 internal tan-dem duplications (ITDs) or other large indels—apartfrom the fact that the ITD can be difficult to sequenceand analyze because of its variable size [33,36,37]. AsThol et al. [24] emphasized, the “amount of DNA usedin the PCR amplification and the error rate of thesequencing system are additional factors that may influ-ence MRD sensitivity” [24]. Finally, false-positiveresults may arise from cross-contamination betweensamples [38−40], which is a known potential risk inNGS [41]. This issue is relevant in research studiesinvolving longitudinal assessment of patient samples.

Reaching adequate amounts of DNA and sequencingdepthsOne may expect that DNA from a million cells, crudelyequivalent to 6.5 mg, is sufficient to reach a sensitivity

Table 1. Reported sensitivities across selected studies

Study Disease Platform Amount of material/reads Estimated sensitivity Cohorta

[23] ALL PCR 1 mg, »6.5£ 105 cells 10−4−10−6 251 samples/168 patients[10] ALL PCR (BIOMED-1) — 10−3−10−4 91 patients

[22] B-ALL NGS (LymphoSIGHT),

ASO-qPCR, FC

6.0£ 105

(0.9−17£ 105) cells

10−6 (10−7) 106 patients

[24] AML NGS (FLT3 /NPM1) 7,758 and 15,278(393−24,997) reads

5£ 10−4 80 samples

[21] ALL/MCL/MM NGS, PCR (Multiplex) 1.5 mg (3£ 500 ng) 10−5 378 samples/55 patients

[25] ALL NGS (LymphoSIGHT) 10 mL PB/5 mL BM (MNCs) 10−4−10−6 237 samples/29 patients[26] MM NGS (LymphoSIGHT) <300,000 cells 10−3−10−5 133 patients

[27] B-ALL FC >750,000 cells 10−4 2,479 patients

[7] B-ALL NGS (ImmunoSeq IgH) 3 mg (sequences) 10−7 41 patients

[28] MM FC — 10−4 700 patients[29] MM FC »107 cells 10−5−10−6 385 samples/patients

[30] B-ALL NGS (BIOMED-2) 500 ng 10−4 (−10−5) 228 samples/30 patients

[31] MM NGS (LymphoSIGHT) 103 (59−288) mg/19 (6−36) mg

10−6−10−7 125 patients

[32] B-ALL NGS (LymphoTrack) 0.5−5 mg (105 reads) 10−6 122 samples/30 patients

[33] AML NGS (FLT3 -ITD) 0.7 mg, »100,000 cells 10−5 80 patients

[34] MM NGS (LymphoSIGHT) — 10−6 700 patients in overall study

[8] B-ALL NGS (ImmunoSEQ), FC 0.4−8 mg 10−6 619 patients[20] MM NGS (Lymphotrack) 3£ 1 mg 10−5 4 patients

[9] CLL NGS (ClonoSEQ) N/A (>1.9£ 106 cells?) 10−6 62 patients

[17] T-ALL NGS (capture-based panel) 0.6−1 mg 10−4−10−5 23 patients[35] MCL PCR (digital droplet) — 10−4−10−5 416 samples/166 patients

[16] ALL NGS (RNA) 0.4−1mg 10−4−10−5 258 patients

B-ALL=B-cell acute lymphoblastic leukemia; FC=flow cytometry; MCL=mantle cell lymphoma; T-ALL=T-cell acute lymphoblastic leukemia.aCohort is defined as estimated number samples and/or estimated number of patients included.

M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24 17

165

of 10−6. However, calculations reveal that this is notthe case, or as van Dongen et al. [2] explain, “the sen-sitivity of HTS [high-throughput sequencing] is depen-dent on the number of analyzed cells and thecorresponding amount of DNA” [2]. It is concludedthat a small amount, well below a million cells, forexample, 2−4 mg, will not suffice. Even with severalmillion cells obtained at the starting point, 50% of thetotal DNA may be lost during purification. A practical

option is to operate with a cell equivalent derived froma known quantity of purified DNA, which is then pipeddirectly into the PCR amplification step.

Currently, there are several commercial assays suchas LymphoTrack (Invivoscribe) and clonoSEQ (Adap-tive Biotechnologies, Seattle, WA, USA). As is exem-plified in the clonoSEQ Assay Technical Information(PNL-10027-02), a linear correlation between theempirical and expected levels of MRD can be obtained

Figure 2. The individual steps of the workflow of measurable disease detection affect the final sensitivity of the assay. In its most simple form,the PCR may be described as a power function, where the DNA molecule is doubled for each cycle (C): 2C. However, the PCR amplification of

gene rearrangements may be considered as a multiplex reaction, each with potentially different sequences and affinities between primers and tar-

get and, consequently, chain reaction efficiencies (E ≤ 1). The PCR product is thus a sum of the individual reactions (Eq. 1, Ni). In some cir-

cumstances, the efficiency of the clonotype sequence amplification may be lower than that of the background resulting in a skewed ratio (Eq. 2,RPCR 6¼ 1), which is augmented by the number of cycles (C). The biological sample drawn from the patient and the subsequent laboratory steps

(A) (n) may be interpreted as a stochastic process of subsampling, each following a Poisson distribution (Eq. 3). It simply describes the proba-

bility of observing the specified number of independent events, that is, rearrangements, when considering the amount of cells/DNA relative to

the sensitivity level (denoted by λ). Note that the probability of a false-negative observation is here given as 1 − P(k = 0). If the PCR efficien-cies are not comparable, that is, RPCR 6¼ 1, then this may heavily affect each step of the sampling and, hence, the final resolution (Eq. 4). The

simplified model shown here explains that at least three times (λ = 3 £) the number of cells/DNA, compared with the sensitivity level, must be

analyzed to attain confidence of approximately 90% when DNA sampling and sequencing are considered as 2£ individual steps (95% each) (B).The DNA may be distributed in a single reaction or in replicates. Here, it also suggests that for each successive subsampling in the workflow,

the probability (p) of detecting a clone roughly decreases to the power of individual steps if the amount of material or reads if kept constant.

Consequently, the amount of material, sequencing depth, and so forth must match the desired sensitivity.

18 M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24

166

at a wide dynamic range. However, a large variance orcoefficient of variation in the lowest detection range isalso observed, as expected from such assays. Multipleresearch studies have used serial dilutions to revealdynamic ranges of detection down to 10−6 or 10−7

[22,32,42]. As with other biomedical laboratory assays,the dynamic range found by serial dilutions in a con-trolled setup may not be entirely indicative of the per-formance in a clinical application. Although commonand generally accepted methods, such assays have limi-tations that may lead to false indications on the levelof sensitivity.

The random fluctuation observed in the lower-rangemeasurements from serial dilutions is partly attributedto variance, being inversely proportional to a decreas-ing number of DNA molecules or cells. As Yao et al.[20] warned, despite detecting residual disease at lowfrequencies (10−5), the results “were not reproducibleamong replicates” [20], consistent with the expectedvariation described by Poisson statistics. Another con-sequence of this stochastic nature is that if one wantsto determine that a certain assay can reach sensitivitiesdown to the 10−6−10−7 level, it crudely becomes amatter of running enough samples.

From an analytical stance, one problem is evident inthe current body of literature: establishing a sensitivitylevel set by the amount of input DNA. By providing103 and 19 mg of DNA to assess MRD from autograftand bone marrow (BM) cells [31], respectively, Taka-matsu et al. [31] concluded that NGS provides a one totwo orders of magnitude increase in sensitivity com-pared with ASO-qPCR (10−4−10−5). Because 0.6 mgof DNA was used for the ASO-qPCRs [31], corre-sponding to approximately 100,000 cells, the resultsare not directly comparable. Furthermore, sensitivity isnot only limited by the amount of DNA used, as wewill argue. Converting the reads to MRD is anothermatter. Yao et al. [20] provided useful guidelines inthe attempt to standardize the NGS MRD assay usingLymphoTrack and MiSeq, but they noted that there iscurrently no consensus on the implementation of spike-in controls in NGS [20]. In comparison, the clonoSEQassay additionally implements the total number ofnucleated cells for internal normalization.

Kotrova et al. [30] argued that although NGS iscommonly presented as the most sensitive modality,with the number of reads being the main adjustablefactor, it is primarily a matter of sample input amountand PCR efficiency [30]. Presumably, it is all three.When MRD quantification is performed based on NGS,it is important to aim at a sufficient number of reads toobtain the desired sensitivity. Aiming at a number ofreads lower than that required by the desired sensitivity[32] or not reporting the number of reads [31] poses aproblem when evaluating the assay. Reviewing NGS,

Takamatsu argued that the PCR products are sequencedat least 106 times using high-throughput NGS toachieve MRD detection of one in a million [43]. Con-trary to the reasoning in an earlier article on monitor-ing FLT3 -ITD and NPM1 mutations [24], 10,000sequencing reads do not equal a sensitivity of 10−4,nor do a million reads confidently reveal one in a mil-lion cells (see also [22,24,44]). One of the strategieshas been presented as “Typically, the number of readsexceeded the number of starting molecules, allowingevery starting molecule to be sampled” [22], whilebeing a statistical problem. As follows, a million readsare not sufficient in most cases to reach a 10−6 detec-tion level, nor to sample every starting molecule. Ifeach clonal gene rearrangement is successfully con-verted to sequencing reads in an assumed one-to-onemanner, theoretically, it follows from the Poisson dis-tribution that to confidently detect a single or moreclonal reads in 95% of the cases (Figure 2B), threetimes higher coverage of the region and approximately20 mg of DNA must be used. This amount is threetimes as much as may be expected if the same initialreasoning is used for DNA requirements. As Rustadand Boyle phrased it, calling the phenomenon the “ruleof three,” reaching 1 in 10 million is more than a tech-nical challenge [45] and requires a minimum of 30 mil-lion cells. One can easily envision an exhaust ofbiological material, while reaching a low 10−7 detec-tion level. As exemplified statistically, as crude as itmay be, this corresponds to just around 200 mg ofDNA, when estimating the amount of DNA in eachcell to be approximately 6.5 pg. A quarter more isneeded to lower the theoretical false-negative rate from5% to 2%, that is, corresponding to four times theeffective sensitivity level (Figure 2B). On the basis ofthese calculations, even the impressive assessment ofMRD in autografts using 103mg (59−288mg) of DNA[31] may fall short of the material needed to reach thedesired threshold with sufficient confidence. It is real-ized that the “rule of three,” derived from the Poissondistribution, may be insufficient in practice [45]because one would expect a loss of material, carryingDNA from several laboratory steps or if the conclusionrequires more than a single observation to confirmMRD clinically. This theoretical basis also justifies theuse of replicates when requirements are not met foreach individual analysis. It has previously beenassessed that replicates mitigate experimental errorsand increase the reliability and statistical power of theNGS assay [46,47] in comparison to the increaseddepth of coverage. Although several factors influencethis effect according to whether a technical or biologi-cal replicate is implemented, it is trivial to see that theeffective probability of an assay failure decreases withthe number of replicates. In its most simple form, this

M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24 19

167

can be described as the combined probability p(falsenegative)n, where n is the number of independent repli-cates. Thus, even relatively high error rates can bereduced effectively, such as widely used triplicates inqPCR.

Type of material used in MRD measurementsAnother pertinent question relates to the specific typeof material on which to base MRD measurements: areamplified immunoglobulin sequences based on bulk ormononuclear cells (MNCs) from bone marrow cellscomparable to those from purified B lymphocytes? Cur-rently, no studies consistently indicate that isolation ofB cells using a cell sorter, or a magnetic separationsystem, is required to reach a high level of sensitivityor that it outperforms analyses on MNCs, and mostcurrent studies using sequencing of clonal gene rear-rangements as an MRD tool do not rely on cell sorting.Also relevant is whether to base MRD measurementson peripheral blood, BM, circulating cell-free tumorDNA (ctDNA), and so forth. Selecting appropriately isemphasized by reports of differential levels of detec-tion, such as BM versus peripheral blood in CLL.Many lymphoma subtypes are aleukemic and oftenhave no bone marrow involvement. In this setting,ctDNA MRD detection is eagerly warranted, and pre-sumable sensitivity issues become even more challeng-ing. Recently, detection of circulating DNA frommultiple myeloma cells has gained interest because ofthe ease of sampling compared with bone marrowbiopsy. Although the detection has been backed by sev-eral reports [48−52], it has also been concluded thatthere is generally no correlation between circulatingtumor DNA and BM MRD levels [53]. In contrast, adirect correlation between clonal concentration in BMand peripheral MNCs, evaluated by sequencing, hasalso been reported [54].

From a clinical perspective, other problems mayarise. A patient brought into deep remission potentiallyexperiences severe therapeutic lymphocyte depletion.In such instances, the sorting of B cells may only pro-vide a small amount of DNA. Whereas bulk or MNCDNA from BM, peripheral blood, or other sources mayprovide enough material to reach the desired sensitiv-ity, it is methodologically different from sorted cells,and the results cannot be directly compared. Althoughcell sorting may potentially increase the sensitivity, itmay lead to a loss of material and present additionalsample requirements, such as in cryopreservation.Thus, choosing a strategy or assay suited for a specificpurpose is imperative. To provide an example, oneassay can be based on calculating the fraction of clonalsequences relative to total nucleated cells, as men-tioned previously; another may instead define MRD asclonal cells per million or rely on a spike-in control

equivalent to a fixed number of cells. Unarguably, thedifferent approaches will also perform differently underMRD circumstances in which the patient may bedepleted of B cells.

Another ongoing discussion relates to the suitabilityof formalin-fixed paraffin-embedded (FFPE) tissuespecimens. Even though thorough studies have foundthat FFPE of shorter amplicon length can be used diag-nostically to establish an MRD marker [42,55], it can-not currently be recommended for the application ofsensitive clinical MRD measurement because of therequirements regarding concentration, DNA integrity,and number of usable reads. As determined by Arcilaet al. [39], FFPE samples were generally of lower cov-erage and had a higher failure rate. Finally, as an alter-native to DNA, RNA is another choice of nucleic acid,which is not explored in depth here. One potentialadvantage of RNA is that the clonal sequence may behighly expressed, as explored recently in a cohort of258 diagnostic childhood B-cell ALL cases [16].Another study found that although the leukemic clonecould be identified in both DNA and RNA, the tran-scriptional clonal frequencies were generally lower[56]. Caution is warranted because not all rearrange-ments are transcribed, as reported by several studies[57−59]. As RNA sequencing has often been per-formed using shorter reads in its more general applica-tion, the robustness becomes apparent from the resultspresented by Blachly et al. [60], who reported strikingconcordance between Sanger sequencing and NGSusing 91 and 50 nucleotide paired-end reads. In thatstudy, no targeted enrichment was performed and onlya minor fraction of the reads mapped to the IGH regionof the genome.

Standardization of NGS MRDIt has become clear that standardization is important tothe future clinical applicability of MRD detectionemploying NGS. Such work is currently being per-formed by the EuroClonality−NGS consortium, which,in the recently published standardizations of IGH andTCR NGS, has presented protocols and guidelines forMRD marker identification and, not directly, the detec-tion of MRD [61]. However, it seems that the piecesare currently being put together at a rapid pace;Kotrova et al. [62] presented a protocol on behalf ofthe EuroClonality−NGS Working Group, which mayhelp to increase assay reproducibility and leverage theNGS MRD field. The consortium also provides soft-ware for the analysis and interpretation of largeamounts of immunogenetic data [63]. Although manyresearchers may implement commercial assays, such asLymphoTrack and clonoSEQ, there is a need for stan-dardized general guidelines for the implementation ofNGS MRD targeting clonal gene rearrangements.

20 M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24

168

Currently, this still leaves several issues regarding“how low you can go,” by practical means, largelyunsettled. Standardizing all aspects of such assays,including guidelines for reaching a stable sensitivity, isa huge undertaking still raising many questions, bothtechnically and biologically, and will expectedly con-tinue for years.

ConclusionsMuch work has been focused on designing and stan-dardizing primers, but the guidelines involving typeand sufficient amount of input material, adequate num-ber of on-target sequencing reads, sequence similaritythreshold for determining clonotypes, cross-contamina-tion, and other factors have not been addressed to thesame extent. A critical assessment of the practicallyfeasible sensitivity is often lacking or is approachedusing serial dilutions to determine a high sensitivity.Although agreeing that NGS provides high sensitivityand specificity, we find that there is a need for a morecritical and standardized evaluation of MRD detection.

We suggest that guidelines for MRD assessment usingNGS must stringently include the depth of coverage.The total number of sequencing reads attained is not areliable measure as reads may be unspecific targetsbecause of the competitive binding of primers in DNAsamples with a small amount of lymphocyte DNA.

Such guidelines, and reports of the results, must atleast include the amount of DNA used in the assay, butalso considered in subsequent steps such as PCR andlibrary preparation; the actual number of reads for eachsequenced sample and the effective depth of coveragefor clonal reads; transparent calculations of theachieved MRD sensitivity; and the use of mismatchthresholds or strategy when evaluating sequence devia-tions from the diagnostic clone. Other points raisedpertain to the most optimal biological sample type fora given disorder, potential enrichment strategies suchas MNC, and the possibility that such measures maypotentially be detrimental for the assay, such as retriev-ing a small number of cells by cell sorting. Most often,the cellular composition changes during treatment.

Table 2. Selected issues and concerns when detecting measurable disease by targeting immunoglobulins

Subject Consideration

Type of material used Is the biological sample suitable in terms of the type of leukemia or lymphoma?

Is the quality sufficient for the intended sensitivity? For example, is it derived from mononuclearcells, sorted cells, FFPE, etc.?

Biological differentiation If the investigated malignancy is very immature, there is a risk that the rearrangement has not taken

place.

Using RNA instead of DNA Potentially nontranscribed rearrangements, decreased stability, and a higher degree of mismatch. How-ever, the number of transcripts may also be substantial.

Amount of material Does the number of cells and DNA support the desired sensitivity level?

Implementing cell sorting Does sorting provide improved resolution, lead to loss of material or more complex workflow, or

change the interpretation of sensitivity level?Effects of subsampling Subsampling cannot be avoided because even drawing a sample or sequencing can be regarded as

downsampling or Poisson sampling. The probability of drawing the MRD clone may be described by

the Poisson distribution.Implementation of

assay replicates

Replicates may lower the effective assay failure rate and may influence the sensitivity, but may also

increase the running costs.

Cross-contamination Is there a risk of cross-contamination? Is this risk higher in research projects, where longitudinal sam-

ples may be included in the same PCR and sequencing chip?Sensitivity level What sensitivity level is needed for the assay? Does the amount of DNA and sequencing reads reflect

this?

PCR efficiencies Is the PCR amplification efficiency of the clonotype comparable to that of the background?

Sequencing platform Which sequencing platform should be chosen?What potential difference in error rate is expected?

Read depth and quality Does the number of reads comply with the intended sensitivity level?

How many reads are unmapped?Is the error rate constant between 50 and 30, and does it affect the assay sensitivity?

Specificity Do the sample quality, nucleotide error rate, algorithm mismatch threshold, amplicon, sequencing read

length, etc., support the intended specificity?

Clonotype mismatch What number of nucleotide mismatches is acceptable and what algorithm should be chosen to define aclonotype?

MRD calculations How is MRD defined and is it transparent? Does it depend on the total number of cells or a spike-in

control? Does it implement a probability distribution, potential skew in PCR efficiencies, etc.?

Background level At what read threshold does an identified rearrangement define a clonotype in comparison to the back-ground level?

Testing sensitivity by

serial dilutions

Does the serial dilution assay adequately describe the actual sensitivity of the assay?

M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24 21

169

Also, in research settings, when several longitudinallyacquired samples may be analyzed in the same batch,the risk of cross-contamination or sample carryovermust be evaluated.

Because the scope of this perspective is limited toraise awareness of some of the pitfalls when perform-ing detection of residual malignant lymphocytes withrearranged immunoglobulins by NGS, it is not possibleto provide an exhaustive review of the topics. In sum-mary, we have touched on some of the considerationsprovided in Table 2. Importantly, what the investigatormust realize is that from the point when a sample isdrawn from the patient, the sensitivity of the method isaffected in all subsequent laboratory steps toward thefinal processed sequencing output. In an oversimplifiedexample for the sake of clarity, one may visualize theextraction of approximately 6.5 mg of DNA, that is,1 million cells. The MRD level is 10−6 so if the proba-bility of drawing a sample of one or more malignantcells is theoretically 63%, then this may be reducedfurther to 40% when sequencing to a depth of coverageof 1 million (see Figure 2B). Sampling 3 million cellsand obtaining three times the depth theoretically aver-ages this figure to 90%; however, many other factors,such as quality, source of material, and PCR kinetics,influence the assay. Van Dongen et al. [2] and otherresearchers mentioned here have already touched onthe issue of lack of standardization, but discussion ofthe topic is still required several years after the intro-duction of the NGS/HTS MRD assays. We advocatethat transparent information, such as depth of coverageand the MRD calculation method, is mandatory whenreporting the sensitivity of the assay.

References1. Monter A, Nomdedeu JF. ClonoSEQ assay for the detection of

lymphoid malignancies. Expert Rev Mol Diagn. 2019;19:571–578.

2. van Dongen JJ, van der Velden VH, Bruggemann M, Orfao A.

Minimal residual disease diagnostics in acute lymphoblastic leu-kemia: need for sensitive, fast, and standardized technologies.

Blood. 2015;125:3996–4009.

3. Sanchez R, Ayala R, Martinez-Lopez J. Minimal residual disease

monitoring with next-generation sequencing methodologies inhematological malignancies. Int J Mol Sci. 2019;20:2832.

4. Kotrova M, Trka J, Kneba M, Bruggemann M. Is next-generation

sequencing the way to go for residual disease monitoring in acute

lymphoblastic leukemia? Mol Diagn Ther. 2017;21:481–492.5. Langerak AW, Bruggemann M, Davi F, et al. High-throughput

immunogenetics for clinical and research applications in immuno-

hematology: potential and challenges. J Immunol. 2017;198:3765–3774.

6. Kruse A, Abdel-Azim N, Kim HN, et al. Minimal residual dis-

ease detection in acute lymphoblastic leukemia. Int J Mol Sci.

2020;21:1054.7. Pulsipher MA, Carlson C, Langholz B, et al. IgH-V(D)J NGS-

MRD measurement pre- and early post-allotransplant defines very

low- and very high-risk ALL patients. Blood. 2015;125:3501–

3508.

8. Wood B, Wu D, Crossley B, et al. Measurable residual disease

detection by high-throughput sequencing improves risk stratifica-tion for pediatric B-ALL. Blood. 2018;131:1350–1359.

9. Thompson PA, Srivastava J, Peterson C, et al. Minimal residual

disease undetectable by next-generation sequencing predicts

improved outcome in CLL after chemoimmunotherapy. Blood.2019;134:1951–1959.

10. Bader P, Kreyenberg H, Henze GH, et al. Prognostic value of

minimal residual disease quantification before allogeneic stem-cell transplantation in relapsed childhood acute lymphoblastic

leukemia: the ALL-REZ BFM Study Group. J Clin Oncol. 2009;

27:377–384.

11. Leisch M, Jansko B, Zaborsky N, Greil R, Pleyer L. Next gener-ation sequencing in AML—on the way to becoming a new stan-

dard for treatment initiation and/or modulation? Cancers (Basel).

2019;11:252.

12. van Dongen JJ, Langerak AW, Bruggemann M, et al. Design andstandardization of PCR primers and protocols for detection of

clonal immunoglobulin and T-cell receptor gene recombinations

in suspect lymphoproliferations: report of the BIOMED-2 Con-certed Action BMH4-CT98-3936. Leukemia. 2003;17:2257–2317.

13. Beishuizen A, Verhoeven MA, van Wering ER, Hahlen K,

Hooijkaas H, van Dongen JJ. Analysis of Ig and T-cell receptor

genes in 40 childhood acute lymphoblastic leukemias at diagno-sis and subsequent relapse: implications for the detection of

minimal residual disease by polymerase chain reaction analysis.

Blood. 1994;83:2238–2247.

14. Meleshko AN, Belevtsev MV, Savitskaja TV, Potapnev MP. Theincidence of T-cell receptor gene rearrangements in childhood B-

lineage acute lymphoblastic leukemia is related to immunopheno-

type and fusion oncogene expression. Leuk Res. 2006;30:795–

800.15. Nosaka T, Kita K, Miwa H, et al. Cross-lineage gene rearrange-

ments in human leukemic B-precursor cells occur frequently with

V-DJ rearrangements of IgH genes. Blood. 1989;74:361–368.16. Li Z, Jiang N, Lim EH, et al. Identifying IGH disease clones for

MRD monitoring in childhood B-cell acute lymphoblastic leuke-

mia using RNA-Seq. Leukemia. 2020;34:2418–2429.

17. Cavagna R, Guinea Montalvo ML, Tosi M, et al. Capture-basednext-generation sequencing improves the identification of immu-

noglobulin/T-cell receptor clonal markers and gene mutations in

adult acute lymphoblastic leukemia patients lacking molecular

probes. Cancers (Basel). 2020;12:1505.18. Gawad C, Pepin F, Carlton VE, et al. Massive evolution of the

immunoglobulin heavy chain locus in children with B precursor

acute lymphoblastic leukemia. Blood. 2012;120:4407–4417.19. Gonzalez D, Gonzalez M, Alonso ME, et al. Incomplete DJH

rearrangements as a novel tumor target for minimal residual dis-

ease quantitation in multiple myeloma using real-time PCR.

Leukemia. 2003;17:1051–1057.20. Yao Q, Bai Y, Orfao A, Chim CS. Standardized minimal resid-

ual disease detection by next-generation sequencing in multiple

myeloma. Front Oncol. 2019;9:449.

21. Ladetto M, Bruggemann M, Monitillo L, et al. Next-generationsequencing and real-time quantitative PCR for minimal residual

disease detection in B-cell disorders. Leukemia. 2014;28:1299–

1307.22. Faham M, Zheng J, Moorhead M, et al. Deep-sequencing

approach for minimal residual disease detection in acute lym-

phoblastic leukemia. Blood. 2012;120:5173–5180.

23. van Dongen JJ, Seriu T, Panzer-Grumayer ER, et al. Prognosticvalue of minimal residual disease in acute lymphoblastic leukae-

mia in childhood. Lancet. 1998;352:1731–1738.

24. Thol F, Kolking B, Damm F, et al. Next-generation sequencing

for minimal residual disease monitoring in acute myeloid

22 M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24

170

leukemia patients with FLT3-ITD or NPM1 mutations. Genes

Chromosomes Cancer. 2012;51:689–695.25. Logan AC, Vashi N, Faham M, et al. Immunoglobulin and T cell

receptor gene high-throughput sequencing quantifies minimal

residual disease in acute lymphoblastic leukemia and predicts

post-transplantation relapse and survival. Biol Blood MarrowTransplant. 2014;20:1307–1313.

26. Martinez-Lopez J, Lahuerta JJ, Pepin F, et al. Prognostic value

of deep sequencing method for minimal residual disease detec-tion in multiple myeloma. Blood. 2014;123:3073–3079.

27. Borowitz MJ, Wood BL, Devidas M, et al. Prognostic signifi-

cance of minimal residual disease in high risk B-ALL: a report

from Children’s Oncology Group study AALL0232. Blood.2015;126:964–971.

28. Attal M, Lauwers-Cances V, Hulin C, et al. Lenalidomide, bor-

tezomib, and dexamethasone with transplantation for myeloma.

N Engl J Med. 2017;376:1311–1320.29. Flores-Montero J, Sanoja-Flores L, Paiva B, et al. Next genera-

tion flow for highly sensitive and standardized detection of mini-

mal residual disease in multiple myeloma. Leukemia.2017;31:2094–2103.

30. Kotrova M, van der Velden VHJ, van Dongen JJM, et al. Next-

generation sequencing indicates false-positive MRD results and

better predicts prognosis after SCT in patients with childhoodALL. Bone Marrow Transplant. 2017;52:962–968.

31. Takamatsu H, Takezako N, Zheng J, et al. Prognostic value of

sequencing-based minimal residual disease detection in patients

with multiple myeloma who underwent autologous stem-celltransplantation. Ann Oncol. 2017;28:2503–2510.

32. Cheng S, Inghirami G, Cheng S, Tam W. Simple deep sequenc-

ing-based post-remission MRD surveillance predicts clinical

relapse in B-ALL. J Hematol Oncol. 2018;11:105.33. Levis MJ, Perl AE, Altman JK, et al. A next-generation

sequencing-based assay for minimal residual disease assessment

in AML patients with FLT3-ITD mutations. Blood Adv. 2018;2:825–831.

34. Perrot A, Lauwers-Cances V, Corre J, et al. Minimal residual

disease negativity using deep sequencing is a major prognostic

factor in multiple myeloma. Blood. 2018;132:2456–2464.35. Drandi D, Alcantara M, Benmaad I, et al. Droplet Digital PCR

Quantification of mantle cell lymphoma follow-up samples from

four prospective trials of the European MCL Network. Hema-

sphere. 2020;4:e347.36. Levine RL, Valk PJM. Next-generation sequencing in the diag-

nosis and minimal residual disease assessment of acute myeloid

leukemia. Haematologica. 2019;104:868–871.37. Schranz K, Hubmann M, Harin E, et al. Clonal heterogeneity of

FLT3-ITD detected by high-throughput amplicon sequencing

correlates with adverse prognosis in acute myeloid leukemia.

Oncotarget. 2018;9:30128–30145.38. Rawstron AC, Fazi C, Agathangelidis A, et al. A complementary

role of multiparameter flow cytometry and high-throughput

sequencing for minimal residual disease detection in chronic

lymphocytic leukemia: an European Research Initiative on CLLstudy. Leukemia. 2016;30:929–936.

39. Arcila ME, Yu W, Syed M, et al. Establishment of immunoglob-

ulin heavy (IGH) chain clonality testing by next-generationsequencing for routine characterization of B-cell and plasma cell

neoplasms. J Mol Diagn. 2019;21:330–342.

40. Bartram J, Mountjoy E, Brooks T, et al. Accurate sample assign-

ment in a multiplexed, ultrasensitive, high-throughput sequencingassay for minimal residual disease. J Mol Diagn. 2016;18:494–

506.

41. Cibulskis K, McKenna A, Fennell T, Banks E, DePristo M, Getz G.

ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics. 2011;27:2601–2602.

42. Scheijen B, Meijers RWJ, Rijntjes J, et al. Next-generation

sequencing of immunoglobulin gene rearrangements for clonality

assessment: a technical feasibility study by EuroClonality-NGS.Leukemia. 2019;33:2227–2240.

43. Takamatsu H. Clinical value of measurable residual disease test-

ing for multiple myeloma and implementation in Japan. Int JHematol. 2020;111:519–529.

44. Ayala R, Onecha E. Next generation sequencing as the new gold

standard for minimal residual disease detection in B-ALL. J Lab

Precision Med. 2018;3. 97−97.45. Rustad EH, Boyle EM. Monitoring minimal residual disease in

the bone marrow using next generation sequencing. Best Pract

Res Clin Haematol. 2020;33:101149.

46. Ziller MJ, Hansen KD, Meissner A, Aryee MJ. Coverage recom-mendations for methylation analysis by whole-genome bisulfite

sequencing. Nat Methods. 2015;12:230–232.

47. Robasky K, Lewis NE, Church GM. The role of replicates forerror mitigation in next-generation sequencing. Nat Rev Genet.

2014;15:56–62.

48. Kis O, Kaedbey R, Chow S, et al. Circulating tumour DNA

sequence analysis as an alternative to multiple myeloma bonemarrow aspirates. Nat Commun. 2017;8:15086.

49. Mishima Y, Paiva B, Shi J, et al. The mutational landscape of

circulating tumor cells in multiple myeloma. Cell Rep. 2017;

19:218–224.50. Manzoni M, Pompa A, Fabris S, et al. Limits and applications of

genomic analysis of circulating tumor DNA as a liquid biopsy in

asymptomatic forms of multiple myeloma. Hemasphere. 2020;4:

e402.51. Gerber B, Manzoni M, Spina V, et al. Circulating tumor DNA as

a liquid biopsy in plasma cell dyscrasias. Haematologica.

2018;103:e245–e248.52. Rustad EH, Coward E, Skytoen ER, et al. Monitoring multiple

myeloma by quantification of recurrent mutations in serum. Hae-

matologica. 2017;102:1266–1272.

53. Mazzotti C, Buisson L, Maheo S, et al. Myeloma MRD by deepsequencing from circulating tumor DNA does not correlate with

results obtained in the bone marrow. Blood Adv. 2018;2:2811–

2813.

54. Vij R, Mazumder A, Klinger M, et al. Deep sequencing revealsmyeloma cells in peripheral blood in majority of multiple mye-

loma patients. Clin Lymphoma Myeloma Leuk. 2014;14. 131

−139 e131.55. Langerak AW, Groenen PJ, Bruggemann M, et al. EuroClonal-

ity/BIOMED-2 guidelines for interpretation and reporting of Ig/

TCR clonality testing in suspected lymphoproliferations. Leuke-

mia. 2012;26:2159–2171.56. Wu J, Jia S, Wang C, et al. Minimal residual disease detection

and evolved IGH clones analysis in acute B lymphoblastic leu-

kemia using IGH deep sequencing. Front Immunol. 2016;7:403.

57. Tan KT, Ding LW, Sun QY, et al. Profiling the B/T cell receptorrepertoire of lymphocyte derived cell lines. BMC Cancer.

2018;18:940.

58. Plevova K, Francova HS, Burckova K, et al. Multiple productiveimmunoglobulin heavy chain gene rearrangements in chronic

lymphocytic leukemia are mostly derived from independent

clones. Haematologica. 2014;99:329–338.

59. Lombardo KA, Coffey DG, Morales AJ, et al. High-throughputsequencing reveals novel features of immunoglobulin gene rearrange-

ments in Burkitt lymphoma. Blood Adv. 2017;1:1261–1262.

M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24 23

171

60. Blachly JS, Ruppert AS, Zhao W, et al. Immunoglobulin tran-

script sequence and somatic hypermutation computation fromunselected RNA-seq reads in chronic lymphocytic leukemia.

Proc Natl Acad Sci USA. 2015;112:4322–4327.

61. Knecht H, Reigl T, Kotrova M, et al. Quality control and quantifi-

cation in IG/TR next-generation sequencing marker identification:protocols and bioinformatic functionalities by EuroClonality-NGS.

Leukemia. 2019;33:2254–2265.

62. Kotrova M, Darzentas N, Pott C, Bruggemann M, EuroClonality

NGSWG. Next-generation sequencing technology to identifyminimal residual disease in lymphoid malignancies. Methods

Mol Biol. 2021;2185:95–111.

63. Bystry V, Reigl T, Krejci A, et al. ARResT/Interrogate: an inter-

active immunoprofiler for IG/TR NGS data. Bioinformatics.2017;33:435–437.

24 M.H. Hansen,O. C!edile,T.S. Larsen,N. Abildgaard,C.G. Nyvold, / Experimental Hematology 2021;98:14 −24

172

Manuscript VI (preprint)

173

Workstation benchmark of Spark Capable Genome Analysis ToolKit 4 Variant Calling

Marcus H. Hansen1,2,3*, Anita T. Simonsen3, Hans B. Ommen3 and Charlotte G. Nyvold1,2

* Corresponding author

1 Haematolology-Pathology Research Laboratory, Research Unit for Haematology and Research

Unit for Pathology, University of Southern Denmark and Odense University Hospital, Odense,

Denmark. 2Department of Hematology, OUH, Denmark 3Department of Hematology, Aarhus University Hospital (AUH), Denmark

Contact information:

Marcus Høy Hansen (formerly Marcus C. Hansen), [email protected]

Office address: Odense University Hospital, Hematology-Pathology Research Laboratory, J. B.

Winsløws Vej 15, 5000 Odense C, Denmark.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

174

Abstract

Background

Rapid and practical DNA-sequencing processing has become essential for modern biomedical

laboratories, especially in the field of cancer, pathology and genetics. While sequencing turn-over

time has been, and still is, a bottleneck in research and diagnostics, the field of bioinformatics is

moving at a rapid pace – both in terms of hardware and software development. Here, we

benchmarked the local performance of three of the most important Spark-enabled Genome analysis

toolkit 4 (GATK4) tools in a targeted sequencing workflow: Duplicate marking, base quality score

recalibration (BQSR) and variant calling on targeted DNA sequencing using a modest

hyperthreading 12-core single CPU and a high-speed PCI express solid-state drive.

Results

Compared to the previous GATK version the performance of Spark-enabled BQSR and

HaplotypeCaller is shifted towards a more efficient usage of the available cores on CPU and

outperforms the earlier GATK3.8 version with an order of magnitude reduction in processing time

to analysis ready variants, whereas MarkDuplicateSpark was found to be thrice as fast.

Furthermore, HaploTypeCallerSpark and BQSRPipelineSpark were significantly faster than the

equivalent GATK4 standard tools with a combined ~86% reduction in execution time, reaching a

median rate of ten million processed bases per second, and duplicate marking was reduced ~42%.

The called variants were found to be in close agreement between the Spark and non-Spark versions,

with an overall concordance of 98%. In this setup, the tools were also highly efficient when

compared execution on a small 72 virtual CPU/18-node Google Cloud cluster.

Conclusion

In conclusion, GATK4 offers practical parallelization possibilities for DNA sequence processing,

and the Spark-enabled tools optimize performance and utilization of local CPUs. Spark utilizing

GATK variant calling is several times faster than previous GATK3.8 multithreading with the same

multi-core, single CPU, configuration. The improved opportunities for parallel computations not

only hold implications for high-performance cluster, but also for modest laboratory or research

workstations for targeted sequencing analysis, such as exome, panel or amplicon sequencing.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

175

Introduction The Genome Analysis Toolkit (1, 2) (GATK, Broad Institute, MA, USA) has been of tremendous

value to bioinformaticians and biomedical research laboratories involved in next generation

sequencing (NGS). The original GATK papers by McKenna et al. and DePristo et al. (1, 2) have

been elaborated on and referenced extensively, emphasizing its wide usage in research. Not only

did it provide a framework for consistent analysis of NGS data, but it also confronted the problem

of inconsistent assignment of base quality scores by the different sequencing platforms. More

importantly, GATK has also helped in the local realignment of reference genome mapped reads. It

was observed that without these precautions, one in five called variants could be assigned false

positives (2). From a clinical or a research point of view, the lack of such realignment is devastating

for the downstream interpretation of somatic mutations in cancer detected with MuTect (3) or other

somatic callers.

The preceding year has offered several advances to general bioinformatics and processing of

sequencing data: First, 2018 offered new modestly priced chips, such as the 12 to 18-core CPUs

launched in the fourth quarter by Intel and the 32-core by AMD. Secondly, solid state drives (SSD),

which have been a necessity for practical handling of large sequencing files (4), have experienced a

leap in read/write speed with the introduction of consumer disks utilizing PCIe 3.0 hardware

interface. On the software side, the release of GATK4 (software.broadinstitute.org/gatk) with its

fast Spark capabilities has coincided with several of these hardware innovations, while cloud

computing has also become more powerful, flexible and easier to use for parallelization. Thus,

multithreaded or parallel sequence processing on contemporary platforms mediates a potential leap

in the amount of processed data per unit of time, when exploited optimally.

We benchmarked the quite novel parallelization of GATK4 based on Apache Spark (The Apache

Software Foundation, MA, USA), which on one hand enables practical implementation in a high-

performance cluster, and on the other, it also enables efficient multithreading on single (or multiple)

CPUs, which is evaluated here. In theory, this eliminates the need for the scattering of the data into

chunks or intervals in order to utilize the full potential of the available cores on the machine. In

practice, this is still a reasonable approach to increase somatic calling speed with MuTect2 (3) in

version 3.8 and GATK4. Albeit more efficient in the latter version, Mutect2 still awaits to be

“Sparkified”. Behind the motivation and reasoning for local multithreading and parallelization lies

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

176

the different needs in research, development and clinically applied bioinformatics, where a cluster

may be impractical under some circumstances or may give rise to ethical concerns in clinical

laboratory diagnostics. Thus, new possibilities to optimize GATK workflows on single machines

are highly welcomed. For a recent benchmark of GATK4, not focusing on the Spark capabilities,

see Heldenbrand et al. 2018 (5).

Materials and methods

Data used in this benchmark was generated from 93 paired-end sequenced samples based on

mononuclear cell DNA derived from patients diagnosed with acute leukemia. Targeted panel and

amplicon sequencing were performed at Münchner Leukämielabor GmbH (Munich, DE) and

Department of Molecular Medicine (Aarhus University Hospital, Aarhus, DK), using the NovaSeq

and MiSeq platforms (Illumina, San Diego, CA, USA). The median read length was 140 and the

median number of sequencing reads were 28.9x106 (0.6x106–117.1x106). Paired-end whole exome

sequencing data was downloaded from the 1000Genomes project data (6) (HG00121, read length:

90, unaligned and aligned reads: 52x106). Compressed binary sequencing alignment files (BAM)

were prepared by alignment to reference genome (GRCh37) using BWA (7) sorted by queryname

or with subsequent coordinate sorting in Picard (Fig. 1).

GATK4 Spark (version 4.0.12.0/4.1.1.0) on a stand-alone workstation was evaluated against the

GATK4 non-Spark versions of MarkDuplicates, base quality score recalibration (BQSR) and

HaplotypeCaller (n=93) in addition to GATK3.8 and Picard MarkDuplicates (version 1.95, default

apt-get package) – the previous and thoroughly tested production tools used for targeted sequencing

analyses on long term support Ubuntu version 14.04 at the hospital departments. GATK3.8

processing was stopped after an adequate number of the samples had been processed for

comparison (n=59) due to exceedingly long execution times. In addition to panel sequencing, we

evaluated the command-line tools on single exome sequencing samples using the current minor

release (4.1.1.0). All the metrics were gathered locally on a clean installation of Ubuntu (14.04

LTS) using a single stand-alone mid-end workstation configured with a 12 core (24 logical cores)

CPU (i9-7920X, 12x2.9GHz/4.3GHz boost, 16 MB cache, Intel, Santa Clara, CA, USA), 512

Samsung (Seoul, South Korea) GB PCIe SSD (up to 3500 MB/s read/2500 MB/s write) and 32 GB

DDR4 quad-channel RAM. GATK4 commands were invoked through the supplied wrapper script,

running Java SE 8 (JRE build 1.8, Oracle). A memory swap file of 64 GB was created on the PCIe

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

177

SSD to tackle potential memory shortage. Disk performance and utilization during execution was

additionally assessed with the linux iostat command.

Cloud evaluation of the Spark-enabled tools (version 4.1.1.0) was performed on a small virtual

cluster with 18 nodes (Google Cloud (GC) Platform, Google LLC, Mountain View, CA, USA),

running Ubuntu 18.04 LTS and Apache Spark 2.4, with 72 virtual CPUs (vCPUs) persistent SSD

disks and 270 GB memory. Same GATK arguments as the local workflow described above were

provided along with additional cluster specific arguments (Fig. 1). The number of executors and

vCPUs, equivalent to the number of hyperthread cores, used for the comparison matched the

number of nodes and available cores, respectively. The specific number of nodes used here was

selected from performance testing of different node configurations with a fixed number of total

vCPUs (i.e. 18x4, 9x8 and 4x18) and GATK4 Spark processing speed, hence the implementation of

72 vCPUs. Performance was evaluated based on the public exome data set (HG00121 (6)). In

addition, downsampling (15–75%) of the largest aligned panel sequencing data file used in this

benchmark (5.3 GB BAM) to assess the idle initiation times of MarkDuplicateSpark,

BQSRPipelineSpark and HaplotypeCallerSpark on the cluster. Reference genome, list of targeted

regions and data files were loaded from GC data storage buckets upon execution. All exome

processing and cloud tests were performed in triplicates.

Results

In this benchmark report we evaluate the multithreading capabilities of three of the most important

tools in GATK4. Whereas the previous multithreading options in GAT3.8, not utilizing Spark, have

largely been dropped, partly owing to its poor scalability and stability, the current version

introduces sophisticated and unprecedented scaling possibilities, which is suited for contemporary

hardware. This also holds potential for large scale analyses using cloud computing and clusters.

The benchmark results showed a decrease in processing time with both the non-Spark and Spark

tools, and a linear relationship between the number of handled bases and the amount of time spent

per sample. GATK4 MarkDuplicates was twice as fast as the early Picard MarkDuplicates (version

1.95) with a superior correlation between execution time and the number of bases processed (Fig.

2A), whereas MarkDuplicatesSpark from coordinate sorted alignments showed an overall poor

correlation and decreased efficiency in this setup with roughly ten million bases processed per

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

178

second, estimated from the median rate. Subsequently, we benchmarked alignments sorted

according to queryname and the new documentation guidelines in version 4.1.1.0, which offered a

stable version of MarkDuplicatesSpark. This provided a consistent and significant increase in speed

(Fig. 2B, see Fig. 4A for exome data) by a factor of 1.7 (~3.6x107 b/s), compared to

MarkDuplicates (~2.1x107 b/s)), and could utilize BWA output directly. Output from high-depth

amplicon sequencing gave rise to a few outliers with lower speed.

Combined base quality score recalibration and variant calling for panel and amplicon sequencing

were seven to eight (6.7–8.0) times faster than the non-spark GATK4 counterpart (Fig. 3A, see Fig.

4A for exome data), based on the median number of bases per second (b/s). Both tools were

superior in terms of processed bases per unit of time, in comparison to GATK3.8, with nearly two

(1.7–1.9) and up to 15-fold (11.15–14.6) increase in efficiency with the GATK4 non-Spark and

Spark versions, respectively. The CPU usage in HaplotypeCallerSpark was close to full utilization

of all 24 logical cores in major parts of the procedure (Figure 3B, randomly drawn sample) with

less than 10 GB of RAM used at any time point in this setup. Likewise, BQSRPipelineSpark

showed hefty bouts of CPU activity at near maximum utilization in its runtime as well as low RAM

usage. The median processing rate of BQSRPipelineSpark and HaplotypeCallerSpark, combined,

was ten million bases per second, highly comparable to the tested larger exome sequencing data,

with each separate processing rate 2.1x107 and 2.0x107 b/s.

Disk utilization is a potential limiting factor in the overall processing speed. However, the PCIe

SSD read/write burden was found to be low during major parts of the GATK4 workflow in contrast

to CPU load, with the exception of BQSRPipelineSpark initiation (Fig. 3C, 1.9 GB BAM).

Variant concordance between GATK versions

Next, we compared the panel sequencing variant output from GATK4 HaplotypeCaller with the

respective variant call format (VCF) files generated by HaplotypeCallerSpark (Fig. 3D) and the 3.8

version of HaplotypeCaller. The median concordance, calculated as the fraction of GATK4 variants

identical to the observed variants for each set, was in the first case found to be 0.98 (Q1=0.90,

Q3=0.99) and 0.99 (0.96, 1) with or without indels, respectively, at a read depth threshold of 100

selected from a median depth 5801 (2629–6506). The median number of variants were 88 for

GATK4 HaplotypeCaller and 90 for HaplotypeCallersSpark. The variant concordance to

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

179

HaplotypeCaller 3.8 was found to be 0.94 (0.92, 0.97) and 0.96 (0.94, 0.98), with 89 variants called.

In each case, the number of called variants compared to standard GATK4 HaplotypeCaller was not

significantly different (pMann–Whitney > 0.05) and the concordance was found to be largely invariant to

decreasing read depth thresholds.

Implementing Spark workflow for exome sequencing and the Google Cloud Platform

Based on the results from the benchmark, we tested the Spark workflow on public whole exome

sequencing data (HG00121) – from unaligned reads (52 million reads of 90 bases each) to analysis

ready variants. The workflow combined BWA and the most current GATK4 MarkDuplicatesSpark,

BQSRPipelineSpark and HaplotypeCallerSpark. Processing was completed in 28 minutes

(measured in triplicates), of which BWA used 15 minutes. The combined GATK4 Spark rate

corresponded to 5.9x106 b/s, or approximately 4 million reads processed every minute, and a 78%

reduction compared to equivalent GATK4 non-Spark (Fig. 4A). Likewise, a 53% reduction relative

to the effective 72 vCPU cluster processing rate was observed. However, a second test using

downsampling (15-100% of 106x106 aligned reads, 5.3 GB) to test reproducibility showed that idle

script initiation, such as copying of the files from the databucket, consumed a large part of the

cluster wall-time ( Fig. 4B). Therefore, for a fair comparison to the local machine utilizing a faster

PCIe SSD, the benchmark baseline was adjusted accordingly. The equivalent speed was found

comparable to the stand-alone workstation. We also observed that random downsampling did not

alter the wall-time of HaplotypeCallerSpark, at least for the presented setup. This indicates that the

number of variants influence the execution time, rather than the depth of coverage. Furthermore, it

may cautiously indicate that a proper implementation of HaplotypeCallerSpark may provide

extremely fast variant calling of large alignment files, such as whole genomes or exome sequencing

with high depth of coverage.

Discussion

We observed that HaplotypeCallerSpark and BQSRPipelineSpark were significantly faster than the

equivalent GATK4 standard tools with a combined 85% reduction in execution time, and with a

median rate of one million processed bases per second. The called variants were found to be in

close agreement with the non-Spark version, with an overall concordance of 98%. The tools also

displayed a higher performance locally in comparison to the 72 virtual CPU/18-node cloud cluster.

GATK4 Spark implementation of quality score recalibration and variant calling was efficient and

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

180

matched, or even outperformed, the configured cloud cluster in terms of processing time. This may

only be true for smaller sequencing files of a few gigabytes or less. We argue that once the GATK4

Spark tools enter stable release, HaplotypeCallerSpark and BQSRPipelineSpark will be highly

suitable for targeted sequencing and even for fast processing of medium coverage exome

sequencing on modern CPUs and disks. With the most optimal usage of the GATK4 tools

subsequent to alignment, it is feasible to finish analyzing a whole exome or extensive panel

sequencing with high depth of coverage and paired-end reads in less than one hour on a

contemporary workstation, while following a workflow similar to current Best Practices with

regard to sequencing preprocessing and germline variant calling

(software.broadinstitute.org/gatk/best-practices). The results show that processing of approximately

24–48 typical whole exomes of 50–100 million paired-end sequencing reads can be performed in a

single day – and a larger amount of typical deep panel sequencing samples.

From the analyses we found both the Spark and non-Spark version of GATK4 MarkDuplicates to

be fast and highly suitable for production use, while the BQSRPipelineSpark and HTCSpark

provided efficient processing with minor deviation from non-spark variant calling. As GATK4

MarkDuplicatesSpark performance has been a subject of online debate, we emphasize that sorting

by queryname is extremely important before submitting the alignment to tagging or removal of

duplicate reads. Of note, MarkDuplicatesSpark is suited for processing of BWA output directly,

which makes the processing speed achieved here even more remarkable, and thus not directly

comparable to the non-Spark version.

As of early 2019, most of the GATK Spark tools are still offered in a beta version, and some tools,

such as MuTect2, do not yet have inherent capabilities for parallelization. With the recent release of

version 4.1, MarkDuplicatesSpark has moved beyond its beta version. Currently, pipeline tools

other than the BQSRPipelineSpark have been released, such as ReadsPipelineSpark which accepts

unaligned BAM and outputs analysis ready variants (VCF). From our evaluation the usage of this

pipeline tool is confined to experimental testing as we have experienced several crashes just before

the generation of the VCF file. We have also experienced close to physical RAM depletion at

runtime. Thus, we recommend using more than 32 GB of RAM, as some Spark pipeline procedures

may be memory-intensive.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

181

In perspective, the current possibilities of installing high-end workstations with 32–36 physical

cores, e.g. by using two Intel Xeon, X-series processors or even a single 32-core CPU with AMD

RYZEN ThreadRipper (Santa Clara, CA, USA, launched in 2018), may render the need for cluster

access superfluous in many research or clinical laboratories. Such impetus and innovations, as

described, offer a viable and economical alternative to large high-performance clusters or to

Illumina’s 2018 soft- and hardware acquisition, DRAGEN from Edico Genome, which boasts to

process an entire genome (30x coverage) in just 25 minutes (Illumina). It had already proven its

worth in 2015, when setting the world record in clinical whole genome analysis (8), and again in

2018. While exciting, the analysis platform may be overshooting the target for the majority of

researchers and laboratories involved in next generation sequencing analysis – in terms of

production capacity and price.

Conclusions

In conclusion, the Genome Analysis ToolKit 4 offers unprecedented possibilities for parallelization,

as shown here by the high efficacy and performance on local cores. It was shown to be several

times faster than previous GATK3.8 multithreading with the same multi-core, single CPU,

configuration. The complexity of the toolkit – with regard to variant calling – is lowered compared

to earlier variant calling with GATK3 Unified Genotyper, as indel realignment is omitted from the

latest major version. In addition, GATK4 offers new tools which combines several earlier

commands as pipeline versions, although still not as stable releases. It comes with “boxed” Spark

capabilities per se without the need for clusterization in order to directly benefit from the innovation

brought by the Broad Institute without Apache Spark installation.

Key words: GATK4, Genome Analysis Toolkit, variant calling, next generation sequencing, Spark,

Multithreading and parallel computing

Declarations

Ethics approval and consent to participate

Ethics approval is not required according to National Committee on Health Research Ethics in

Denmark (http://en.nvk.dk/how-to-notify/what-to-notify) in the current form of methods testing and

quality control without any identifiable information.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

182

Consent for publication

Not applicable.

Availability of data and material

The panel and amplicon sequencing data used in this study is not publicly shared as it is based on

routine clinical hospital tests of leukemia patients with no other purpose in this benchmark than to

gather performance metrics. All rights reserved to chief physician, phd, Hans B. Ommen (co-

author) and Aarhus University Hospital, Central Denmark Region. Data access can be granted on

request if relevant and in accordance to national ethical guidelines. Contact corresponding author

for more information. Exome data can be downloaded from The International Genome Sample

Resource (www.internationalgenome.org, ID: HG00121)

Competing interests

MCH is a part time independent consulting biomedical engineer and part time research fellow at

Odense University Hospital and Aarhus University Hospital. The other authors have no disclosures.

None of the authors are affiliated with the Broad Institute, Cambridge, MA, USA, or have a role in

the development of GATK.

Funding

Sequencing expenses was covered by AUH. All other parts of this study have been funded privately

by MCH and by the Faculty of Health Sciences, University of Southern Denmark.

Authors' contributions

ATS and HBO provided clinical sequencing samples for the analysis, MCH performed all analyses

described in the paper and wrote the draft. CGN conducted scientific supervision and assisted in

finalizing the manuscript. All authors contributed to the manuscript.

Acknowledgements

The authors would like to thank Dina Mohyeldeen, MD, for a critical view on the final draft. We

also wish to thank professor Peter Hokland, Aarhus University Hospital, for continuous scientific

supervision and support.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

183

References

Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-303. 2. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491-8. 3. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31(3):213-9. 4. Lee S, Min H, Yoon S. Will solid-state drives accelerate your bioinformatics? In-depth profiling, performance analysis and beyond. Brief Bioinform. 2016;17(4):713-27. 5. Heldenbrand JR, Baheti S, Bockol MA, Drucker TM, Hart SN, Hudson ME, et al. Performance benchmarking of GATK3.8 and GATK4. bioRxiv. 2018. 6. Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68-74. 7. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754-60. 8. Miller NA, Farrow EG, Gibson M, Willig LK, Twist G, Yoo B, et al. A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases. Genome Med. 2015;7:100.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

184

Figure legends

Figure 1) Benchmark workflow and commands. Coordinate sorted binary alignment files (BAM)

were prepared from targeted paired-end sequencing and used in the benchmark test of variant

calling workflow using GATK4 Spark, GATK4 and GATK3.8 with Picard MarkDuplicates 1.95.

The local single CPU tests were conducted on a 12-core (24 logical cores) workstation with 32 GB

RAM. In addition, a cloud cluster was configured with 72 virtual CPUs (18 nodes) to assess

processing of whole exome sequencing data and the largest panel sequencing sample together with

local workstation.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

185

Figure 2) Performance of MarkDuplicates (MD). A doubling in processing speed was observed,

when comparing GATK4 MarkDuplicates with Picard MD. Both tools displayed a linear correlation

between elapsed time and the number of bases processed (A), while the Spark version did not when

using coordinate sorted alignments. The calculated rates were approximately 2.1x107 bases per

second (b/s) for GATK4 MD and 1.1x107 b/s for Picard. The median rate for MarkDuplicatesSpark

(MDSpark) on groupname sorted alignment was estimated to be 3.6x107 b/s or 1.7 times faster than

the single-threaded GATK4 MarkDuplicates (B). The few outliers arise from amplicon sequencing.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

186

Figure 3) Benchmark of Spark variant calling and efficiency. GATK4 Spark quality score

calibration and variant calling combined (BQSRPipelineSpark and HaplotypeCallerSpark) was

found to be several times faster than the non-Spark and GATK3.8 counterparts (A), with both tools

efficiently utilizing multithreading and with low memory consumption. The CPU and memory

usage of HaplotypeCallerSpark based on a random targeted sequencing sample is shown (B).

Read/write utilization (IO) of the PCIe solid state drive (up to 3500 MB/s read/2500 MB/s write)

was found to be low (shown in dark gray) during execution of the GATK4 MarkDuplicateSpark

(MD), base quality score recalibration (BQSRPipelineSpark) and HaplotypeCallerSpark (HTC),

with the exception of a short 20 seconds burst during initiation of BQSR reaching approximately

100% disk utilization (C, median replicate values). The concordance of variant calling between

Spark and non-Spark versions was found to be high (D), with a similar number of variants detected

(nmedian) and variant reads depths (DPmedian).

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

187

Figure 4) MarkDuplicatesSpark, BQSRPipelineSpark and HaploTypeCallerSpark were assessed

using public exome data (HG00121), locally with a 12-core CPU and externally using a 72 virtual

CPU/18-node cluster configuration (Google Cloud platform). Each analysis was performed in

triplicates with high reproducibility. The local Spark execution time was approximately twice as

fast as the cluster and four times as fast as the non-Spark GATK4 tools. We found that initiation of

the cluster processing consumed a large part of the cluster wall-times (*) for MarkDuplicatesSpark

and BQSRPipelineSpark. HaplotypeCallerSpark wall-time did not change with increasing number

of reads from a 5.3 GB sequencing file.

.CC-BY-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted May 19, 2020. ; https://doi.org/10.1101/2020.05.17.101105doi: bioRxiv preprint

188

189

University of Southern Denmark

UNIVERSITY OF SOUTHERN DENMARK CAMPUSVEJ 55 DK-5230 ODENSE PHONE: +45 6550 1000 [email protected] WWW.SDU.DK