Comparative performance of the REGA subtyping tool version 2 versus version 1

6
Comparative performance of the REGA subtyping tool version 2 versus version 1 Ana B. Abecasis a,b,1 , Yunpeng Wang a,1 , Pieter Libin a , Stijn Imbrechts a , Tulio de Oliveira c , Ricardo J. Camacho b,d , Anne-Mieke Vandamme a, * a Laboratory for Clinical and Epidemiological Virology, REGA Institute for Medical Research, Katholieke Universiteit Leuven, Minderbroedersstraat 10, Leuven, Belgium b Centro de Mala ´ria e Outras Doenc ¸as Tropicais, Instituto de Higiene e Medicina Tropical, Portugal c Africa Centre for Health and Population Studies, Nelson R Mandela School of Medicine, University of KwaZulu-Natal, South Africa d Laboratory of Virology, Hospital de Egas Moniz , Centro Hospitalar de Lisboa Ocidental, Portugal 1. Introduction Human immunodeficiency virus (HIV) is a retrovirus that causes the acquired immunodeficiency syndrome (AIDS). Due to high rates of mutation and replication, the fast accumulation of proviral variants during the course of infection and the high rate of recombination, HIV exhibits an extraordinary genetic diversity. Historically, HIV-1 has been classified into three major groups: M (major), O (outlier) and N (non-M, non-O). Group M is responsible for the global pandemic, and based on phylogenetic analysis it was further divided into 9 subtypes, A–D, F, G, H, J and a large number of inter-subtype recombinants. Recombinant viruses that are identified in at least three epidemiologically unrelated individuals and characterized by full genome sequencing are designated as circulating recombinant forms (CRFs) (Robertson et al., 2000). Up until now, 43 CRFs have been identified (Los Alamos database). The remaining recombinant forms, which are found in isolated or small groups of epidemiologically related individuals, are called unique recombinant forms (URFs). Different subtypes and CRFs have distinct global distribution patterns (Osmanov et al., 2002). On a global scale, the most prevalent HIV-1 genetic forms are subtypes A, B, C, D and G, accounting for 12%, 10%, 50%, 3% and 6%, respectively, and CRF01_AE, CRF02_AG, each accounting for 5%, of all HIV-1 infections worldwide. In particular, subtype B is responsible for 67% of the overall infections in newly diagnosed patients in Western Europe (Abecasis et al., 2008), while subtype A accounts for 79% in Eastern Europe (Buonaguro et al., 2007; Hemelaar et al., 2006). The high level of genetic variability of HIV-1 may have important implications for HIV pathogenesis, transmission, diagnosis, treatment and vaccine development. It is plausible that different subtypes have different biological properties resulting in differences in transmissibility and pathogenicity, but this issue is still a matter of debate (Hemelaar et al., 2006). A few studies have shown that subtype D may lead to more rapid disease progression than other subtypes (Baeten et al., 2007). It was also reported that Infection, Genetics and Evolution 10 (2010) 380–385 ARTICLE INFO Article history: Received 22 June 2009 Received in revised form 29 September 2009 Accepted 30 September 2009 Available online 12 October 2009 Keywords: HIV-1 Subtyping ABSTRACT The REGA HIV-1 subtyping tool is a phylogenetic-based method for subtyping HIV-1 genomic sequences that was published in 2005. The subtyping tool combines phylogenetic approaches with recombination detection methods. Recently, version 2 was released (http://www.bioafrica.net/rega-genotype/html/ index.html) as an improvement of version 1. Version 2 implements a Decision-Tree-based algorithm that was not implemented in version 1. We wanted to compare the two versions on a large sequence dataset to assess the improvements of version 2 and to verify whether features lost during updating the tool needed to be recovered. We analysed the results of the two versions in the genotyping of 4676 HIV-1 pol sequences. We compared those results to a manual approach, used in previous studies. Our results show that version 2 has an overall better sensitivity but especially for the detection of subtypes A, B, D, F, G and CRF14_BG and CRF06_CPX. For the other subtypes, no significant differences were observed in the sensitivity of versions 1 and 2. The overall increase in sensitivity was however accompanied by a decrease in the specificity for the detection of subtype B. This is the main limitation of version 2. However, while the number of false negatives decreased by 53 samples, the number of false positives increased only by 5 samples from version 1 to 2. The performance of the REGA HIV-1 subtyping tool was considerably improved from one version to the other. Our results are very valuable and allow us to make suggestions for further improvement of the tool for a version 3 release. ß 2009 Elsevier B.V. All rights reserved. * Corresponding author. E-mail address: [email protected] (A.-M. Vandamme). 1 These authors contributed equally to this work. Contents lists available at ScienceDirect Infection, Genetics and Evolution journal homepage: www.elsevier.com/locate/meegid 1567-1348/$ – see front matter ß 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.meegid.2009.09.020

Transcript of Comparative performance of the REGA subtyping tool version 2 versus version 1

Infection, Genetics and Evolution 10 (2010) 380–385

Comparative performance of the REGA subtyping tool version 2 versus version 1

Ana B. Abecasis a,b,1, Yunpeng Wang a,1, Pieter Libin a, Stijn Imbrechts a, Tulio de Oliveira c,Ricardo J. Camacho b,d, Anne-Mieke Vandamme a,*a Laboratory for Clinical and Epidemiological Virology, REGA Institute for Medical Research, Katholieke Universiteit Leuven, Minderbroedersstraat 10, Leuven, Belgiumb Centro de Malaria e Outras Doencas Tropicais, Instituto de Higiene e Medicina Tropical, Portugalc Africa Centre for Health and Population Studies, Nelson R Mandela School of Medicine, University of KwaZulu-Natal, South Africad Laboratory of Virology, Hospital de Egas Moniz , Centro Hospitalar de Lisboa Ocidental, Portugal

A R T I C L E I N F O

Article history:

Received 22 June 2009

Received in revised form 29 September 2009

Accepted 30 September 2009

Available online 12 October 2009

Keywords:

HIV-1

Subtyping

A B S T R A C T

The REGA HIV-1 subtyping tool is a phylogenetic-based method for subtyping HIV-1 genomic sequences

that was published in 2005. The subtyping tool combines phylogenetic approaches with recombination

detection methods. Recently, version 2 was released (http://www.bioafrica.net/rega-genotype/html/

index.html) as an improvement of version 1. Version 2 implements a Decision-Tree-based algorithm that

was not implemented in version 1. We wanted to compare the two versions on a large sequence dataset

to assess the improvements of version 2 and to verify whether features lost during updating the tool

needed to be recovered. We analysed the results of the two versions in the genotyping of 4676 HIV-1 pol

sequences. We compared those results to a manual approach, used in previous studies. Our results show

that version 2 has an overall better sensitivity but especially for the detection of subtypes A, B, D, F, G and

CRF14_BG and CRF06_CPX. For the other subtypes, no significant differences were observed in the

sensitivity of versions 1 and 2. The overall increase in sensitivity was however accompanied by a

decrease in the specificity for the detection of subtype B. This is the main limitation of version 2.

However, while the number of false negatives decreased by 53 samples, the number of false positives

increased only by 5 samples from version 1 to 2. The performance of the REGA HIV-1 subtyping tool was

considerably improved from one version to the other. Our results are very valuable and allow us to make

suggestions for further improvement of the tool for a version 3 release.

� 2009 Elsevier B.V. All rights reserved.

Contents lists available at ScienceDirect

Infection, Genetics and Evolution

journal homepage: www.elsev ier .com/ locate /meegid

1. Introduction

Human immunodeficiency virus (HIV) is a retrovirus thatcauses the acquired immunodeficiency syndrome (AIDS). Due tohigh rates of mutation and replication, the fast accumulation ofproviral variants during the course of infection and the high rate ofrecombination, HIV exhibits an extraordinary genetic diversity.

Historically, HIV-1 has been classified into three major groups:M (major), O (outlier) and N (non-M, non-O). Group M isresponsible for the global pandemic, and based on phylogeneticanalysis it was further divided into 9 subtypes, A–D, F, G, H, J and alarge number of inter-subtype recombinants. Recombinant virusesthat are identified in at least three epidemiologically unrelatedindividuals and characterized by full genome sequencing aredesignated as circulating recombinant forms (CRFs) (Robertsonet al., 2000). Up until now, 43 CRFs have been identified (Los

* Corresponding author.

E-mail address: [email protected] (A.-M. Vandamme).1 These authors contributed equally to this work.

1567-1348/$ – see front matter � 2009 Elsevier B.V. All rights reserved.

doi:10.1016/j.meegid.2009.09.020

Alamos database). The remaining recombinant forms, which arefound in isolated or small groups of epidemiologically relatedindividuals, are called unique recombinant forms (URFs).

Different subtypes and CRFs have distinct global distributionpatterns (Osmanov et al., 2002). On a global scale, the mostprevalent HIV-1 genetic forms are subtypes A, B, C, D and G,accounting for 12%, 10%, 50%, 3% and 6%, respectively, andCRF01_AE, CRF02_AG, each accounting for 5%, of all HIV-1infections worldwide. In particular, subtype B is responsible for67% of the overall infections in newly diagnosed patients inWestern Europe (Abecasis et al., 2008), while subtype A accountsfor 79% in Eastern Europe (Buonaguro et al., 2007; Hemelaar et al.,2006).

The high level of genetic variability of HIV-1 may haveimportant implications for HIV pathogenesis, transmission,diagnosis, treatment and vaccine development. It is plausible thatdifferent subtypes have different biological properties resulting indifferences in transmissibility and pathogenicity, but this issue isstill a matter of debate (Hemelaar et al., 2006). A few studies haveshown that subtype D may lead to more rapid disease progressionthan other subtypes (Baeten et al., 2007). It was also reported that

A.B. Abecasis et al. / Infection, Genetics and Evolution 10 (2010) 380–385 381

some subtype G samples are less susceptible to protease inhibitors(Abecasis et al., 2006). Group O viruses are known to be naturallyresistant to non-nucleoside reverse transcriptase inhibitors. The Mgroup viruses are shown to have similar susceptibility to currentlyused drugs, at least in vitro (Abecasis et al., 2006; Palmer et al.,1998) but some groups reported that subtype B is different fromother subtypes in the generation of drug resistance mutationsunder treatment selective pressure (Abecasis et al., 2005; Gross-man et al., 2004; Pieniazek et al., 2000).

Recently, classification of HIV-1 sequences became more basedon online subtyping tools than on manual phylogenetic analysis. Atpresent, several web-applications are available: the NCBI genotyp-ing program (http://www.ncbi.nih.gov/projects/genotyping/form/page.cgi), the Los Alamos RIP program (http://hivweb.lanl.gov/RIP/RIPsubmit.html), the Stanford HIV-seq program (http://hivdb.Stanford.edu), the STAR subtype analyser (http://www.biochem.ucl.ac.uk/bsm/virus_database/) and the REGA subtyping tool(http://www.bioafrica.net/subtypetool/html/). The NCBI genotyp-ing program is based on a BLAST-based sliding window approach.The Los Alamos RIP program uses a sliding window approach basedon similarity distance measurements. With the Stanford HIV-seqsoftware, the subtype of the most similar reference sequence, inprotease (PR) and reverse transcriptase (RT) separately, is assignedto the query sequence. The STAR subtype analyser uses position-specific scoring matrix (PSSM)-based genotyping. Finally, the REGAHIV-1 subtyping tool is the only one to use a phylogeny-basedsubtyping method (de Oliveira et al., 2005; Gale et al., 2004; Korberet al., 2002; Rozanov et al., 2004).

The first version of the REGA subtyping tool was based on astreamline scheme. Two Neighbour-Joining (NJ) trees of the querywith two sets of pre-selected reference sequences, one only withthe pure subtype reference sequences and the other containingboth pure subtypes and CRFs, were constructed sequentially. Afterbuilding each tree, bootstrap testing (100 replicates) was used totest the reliability of the tree clustering. Finally, bootscanninganalysis and likelihood mapping analysis were used to testrecombination and phylogenetic signal (Salminen et al., 1995;Strimmer and von Haeseler, 1997). A bootstrap value of 70% wasused as cut-off value in the preceding two trees, for the assignmentof the query sequence either to a particular pure subtype referenceor to a CRF reference (de Oliveira et al., 2005).

Version 2, on the other hand, follows a Decision-Tree-basedalgorithm, as presented in Fig. 1. At the first step, a pure NJ tree isbuilt, containing the query and reference sequences of the so called‘pure’ subtypes. Depending on the bootstrap value, a differentbranch of the tree is followed. A bootstrap value of 70 is used asthreshold for the split decision between branches of the Decision-Tree. Then, the bootscan method is applied and the bootscansupport (defined as the fraction of windows in which the sequenceclusters with the more frequently supported subtype with abootstrap support above 70; threshold = 0.9) is now used as splitdecision criterion. If the bootscanning procedure with only puresubtypes has a bootscan support <0.9, a NJ tree with only CRFs ismade. Subsequently the CRF clustering with the query in the CRFstree is added to the ‘pure’ subtype references and the bootscanmethod is applied (see Fig. 1). The final result provides the genomesubtype pattern schema, the phylogenetic signal based on likelihoodmapping analysis, the alignment(s), the tree(s) and the bootscanplot(s). The philosophy of the tool is to assign only when confident.This results in the majority of sequences being assigned a subtype,and where not possible, being flagged for further verification bymanual phylogenetic procedures. In this context, some sequencesremain unassigned, compared to other subtyping tools, however wesee this as an advantage rather than a disadvantage as reported byHolguin et al. (2008), who treated such sequences as wrongassignments, thereby claiming the REGA tool to be unreliable.

In this paper we wanted to compare the two versions of theREGA subtyping tool on a dataset of 4676 pol sequences to assessthe improvements of version 2 and to verify whether good featuresof version 1 were lost during updating the tool that need to berecovered for the next version. We compared those results to amanual approach, used in previous studies. Such comparisons arevery valuable to further improve the tool for a version 3 release.

2. Materials and methods

The dataset we used contained 4676 pol sequences derived frompatients at the Egas Moniz Hospital, Lisbon, submitted to resistancetesting either for therapy failure or for baseline genotyping of drugnaıve patients. Data was retrieved from the Egas Moniz RegaDBinstance. The sequences were obtained by population sequencingusing the ViroSeq 2.0 toolkit (Abbott Laboratories, Abbott Park, IL,USA). The sequences were in general approximately 1300 bp long(Min = 993 bp; Max = 1311 bp; Average = 1295 bp).

HIV-1 pol subtyping was first performed using REGA HIVsubtyping tool version 1 and version 2, separately. These resultsfor version 2 were obtained by running this program 10 times andcombining the outcome from each run by occurrence. This is due tothe random nature of the bootstrap procedure and the highsimilarity between these subtypes, the use of a bootscan valueof 90 as split criteria does not generate consistent assignments ofthese subtypes. Version 1 on the other hand does not use bootscanvalue as subtyping criterion, therefore the results are highlyconsistent.

We then performed manual analysis as follows:

(a) The sequences that were classified as ‘unassigned’ by ordiscordant between one of the two versions, were subtypedmanually by phylogenetic analysis. 423 sequences wereunassigned by both, and 68 were discordant between both(Table 1). The programs CLUSTALW (Thompson et al., 1994),Simplot (Lole et al., 1999) and PAUP (Swofford, 1998) wereused for alignment, recombination detection and constructionof the maximum likelihood tree, respectively. For measuringthe reliability of the clustering, 1000 bootstrap replicates wererun in each version. In the situation that the query was clearlyassigned by phylogenetic analysis and not assigned by thesubtyping tools, we recorded these results as false negatives forboth tools.

(b) The concordant assignments of sequences by both tools, in total4195 sequences, were verified first by bootscanning analysiswith our new reference set, a set of carefully chosen referencesequences downloaded from the Los Alamos HIV sequencedatabase (Los Alamos National Laboratory, 2005). If thebootscan plots showed an apparent crossover between twoor more references, a manual phylogenetic analysis wasperformed. The assignment was considered as false positiveif the manual analysis indicated either that the sequence wasunassignable to any subtype or CRF or a different assignmentthan the one indicated by the subtyping tool. Otherwise, if theresult of the manual analysis was consistent with the subtypingtool results, we considered the assignment as a true positiveand no further manual analyses was performed.

The final results were summarized in terms of specificity,sensitivity and reproducibility. True positive and true negative foreach version are defined as the number of queries that getconcordant results from manual analysis and the tools, eitherassigning the sequence to one subtype (true positive) or leaving itunassigned (true negative). False positives and false negatives foreach version are the number of queries that get discordant resultsfrom manual analysis and each version, respectively. A false

Fig. 1. Decision-Tree algorithm of the REGA subtyping tool used to subtype query sequences longer than 800 bp. A brief description of the algorithm is described in Section 1.

Rule 1A: pure—subtype assigned based on sequence >800 bp, clustering with a pure subtype with bootstrap >70% without recombination in the bootscan, and do not

clustering with a CRF with bootstrap >70%.

Rule 1B: pure—subtype assigned based on sequence >800 bp, clustering with a pure subtype with bootstrap >70% without recombination in the bootscan, clustering with a

CRF with bootstrap >70% however not clustering inside the CRF cluster.

A.B. Abecasis et al. / Infection, Genetics and Evolution 10 (2010) 380–385382

Table 1Summary of the results of each method: subtyping tool versions 1, 2 and manual

phylogenetics analysis. In general, version 2 assigns more sequences than version 1

for all the subtypes and CRFs, especially, in subtype B.

Version 1 Version 2 Manual

Subtype A 62 65 66

Subtype B 2125 2149 2152

Subtype C 153 153 153

Subtype D 10 14 14

Subtype F 90 91 91

Subtype G 1545 1555 1575

Subtype H 6 6 7

Subtype J 2 2 2

CRF02_AG 175 177 175

CRF06_CPX 25 39 35

CRF13_CPX 2 2 2

Unassigned 481 423 404

Total 4676 4676 4676

A.B. Abecasis et al. / Infection, Genetics and Evolution 10 (2010) 380–385 383

positive is a sequence that is incorrectly assigned to one subtype bythe tool, while a false negative is a sequence that is incorrectly leftas unassigned by the tool. Sensitivity is defined as: number of truepositives/(number of true positives + number of false negatives);specificity defined as: number of true negative/(number of truenegatives + number of false positives). These measures weresummarized both for all subtypes together and in a subtype-specific way. By doing this, we expect to discriminate betweensubtypes/CRFs that the tools are more/less efficient to genotype.

Table 2Performance of the REGA subtyping tool versions 1 and 2. Results are summarized bot

Subtyping tool v1

TP (n) TN (n) FP (n) FN (n) Sens (%) Spec

A 62 4610 0 4 93.9 100

B 2116 2506 9 36 98.3 99.6

C 153 4523 0 0 100 100

D 8 4660 2 6 57.1 99.9

F 88 4581 2 3 96.7 99.9

G (&14_BG) 1545 3101 0 30 98.1 100

H 6 4669 0 1 85.7 100

J 2 4674 0 0 100 100

02_AG 173 4499 2 2 98.8 99.9

06_CPX 15 4621 10 20 42.9 99.7

13_CPX 2 4674 0 0 100 100

Total 4170 379 25 102 97.6 93.8

Since there is no breaking point in the pol region of CRF14_BG, no sequences were assi

subtype G and CRF14_BG both as subtype G. True positive and true negative are define

analysis and subtyping tool either assigned or unassigned, respectively. False positive and

get the discordant results from manual analysis and subtyping tool either unassigned

unassigned manually (false positive). Sensitivity is defined as: number of true positive

number of true negative/(number of true negatives + number of false positives). TP:

sensitivity; Spec: specificity.

Rule 1C: pure (CRF)—subtype assigned based on sequence >800 bp, clustering with a pu

with a CRF with bootstrap >70% and clustering inside the pure subtype cluster.

Rule 2: check the bootscan—subtype unassigned based on sequence >800 bp, clustering

pure subtype bootscan, and failure to classify as CRFs (bootstrap support).

Rule 3: check the bootscan—subtype unassigned based on sequence>800 bp, clustering w

the pure subtype bootscan, and failure to classify as CRFs (bootscan support).

Rule 4: CRF—subtype assigned based on sequence >800 bp, clustering with a CRF with b

further confirmed as a CRF by bootscan analysis.

Rule 5: check the report—subtype unassigned based on sequence>800 bp, do not cluster

the pure subtype bootscan.

Rule 6: check the bootscan—subtype unassigned based on sequence >800 bp, do not

recombination.

Rule 7: check the bootscan—subtype unassigned based on sequence>800 bp, clustering w

bootscan, and failure to classify as CRFs (bootscan support).

Rule 8: CRF—subtype assigned based on sequence >800 bp, clustering with a CRF with b

further confirmed by CRF by bootscan analysis. http://www.bioafrica.net/rega-genotyp

3. Results

In this study, we aimed to compare the performance of theversion 2.0 to the original one, so we used reference sequencesfrom the same pure subtypes and CRFs: A–D, F, G, H, J, and K andCRF01-14. We detected more than a hundred new URFs and otherCRFs, which have already been published but were not included inthe reference set, in which case we consider the lack of assignmentby the subtyping tool as true negative.

The discordant results between these two versions aremainly because version 1 failed to classify 58 sequences,whereas version 2 assigned most of the sequences. Uponmanually verifying the assignment from version 2, we foundthat most of the unassigned sequences by version 1 weresubtype B, in total 24 sequences. Among these discordantresults, one sequence was false positively assigned to subtype Bby version 2. Furthermore, 10, 2, 14, 4, 3 and 1 sequences werecorrectly assigned to subtype G, CRF02_AG, CRF06_CPX, subtypeD, subtype A and subtype F, respectively. Finally, 2 sequenceswere misassigned to CRF06_CPX and CRF02_AG by version 2. Theresult shows that version 1 is more conservative than version 2,presenting less false positives, but on the other hand much morefalse negatives than version 2.

For the concordant assignments, we found that the misassign-ments concentrate on complex recombinant CRF06_CPX, in whichcase the results of the phylogenetic analysis show that the querysequences seem to be new recombinant forms between CRF02_AG

h subtype-specific wise and for all subtypes.

Subtyping tool v2

(%) TP (n) TN (n) FP (n) FN (n) Sens (%) Spec (%)

65 4610 0 1 98.5 100

2139 2514 10 13 99.4 99.5

153 4523 0 0 100 100

12 4660 2 2 85.7 99.9

89 4581 2 2 97.8 99.9

1555 3101 0 20 98.7 100

6 4669 0 1 85.7 100

2 4674 0 0 100 100

173 4499 2 2 98.8 99.9

27 4617 12 8 77.1 99.7

2 4674 0 0 100 100

4223 374 30 49 98.9 92.6

gned by rule 1C (which takes into account inside/outside clustering), we consider

d for each tool as the number of queries that get concordant results from manual

false negative for each version of the subtyping tool are the number of queries that

by the tool and assigned manually (false negative) or assigned by the tool and

s/(number of true positives + number of false negatives); specificity is defined as:

true positives; TN: true negatives; FP: false positives; FN: false negatives; Sens:

re subtype with bootstrap >70% without recombination in the bootscan, clustering

with a pure subtype with bootstrap >70%, with detection of recombination in the

ith a pure subtype and CRF with bootstrap>70%, with detection of recombination in

ootstrap >70%, with detection of recombination in the pure subtype bootscan, and

ing with a pure subtype with bootstrap>70%, with no detection of recombination in

clustering with a pure subtype or CRF with bootstrap >70%, with detection of

ith a CRF with bootstrap>70%, with detection of recombination in the pure subtype

ootstrap >70%, with detection of recombination in the pure subtype bootscan, and

e/html/subtypedecisiontree.html.

Table 3Summary of the advantages and disadvantages of the REGA subtyping tool version 1

compared to version 2.

Advantages Disadvantages

Version 1 More consistent (no need to

address reproducibility).

Higher specificity for subtype B.

Lower sensitivity.

More computationally

intensive (slower).

Version 2 Higher sensitivity.

Less computationally

intensive (faster).

Lower specificity for

subtype B.

Lower reproducibility.

A.B. Abecasis et al. / Infection, Genetics and Evolution 10 (2010) 380–385384

and CRF06_CPX. In this situation, both versions have the sameproblem.

Table 2 shows the details of the performance of each version foreach subtypes and CRFs. For subtype B, there are 9 wrongassignments for version 1 and 10 for version 2. The frequentcrossover between subtypes B and D, due to the close relatedness ofthese subtypes occurring at the middle region of the pol sequences, isa source of difficulty in assigning sequences to subtype B. There wereno misassignments for subtypes A, C, G, H and CRF13_CPX. But bothversions concordantly gave 2, 2, 2 and 10 wrong assignments tosequences of subtype D, subtype F, CRF02_AG and CRF06_CPX,respectively. Version 2 also gave 2 extra misassignments to the latter2 CRFs resulting in the assignment of 58 discordant sequences.Phylogenetic analysis showed that these two sequences wronglyassigned to ‘subtype D’ were actually recombinants betweensubtypes B and D, and those misassigned to subtype F wererecombinants between subtypes B and F. Again, the wrong assign-ments for CRF02_AG and CRF06_CPX resulted from the fact that thequeries were recombinants of CRF02_AG and CRF06_CPX.

For sequences which were not assigned by both tools, manualanalysis further assigned 23 subtype Bs, 20 subtype Gs, 2 subtypeDs, 1 subtype A, 2 subtype Fs, 1 subtype H, and 8 CRF06_CPXs. All ofthe 23 subtype Bs have the same pattern of crossover betweensubtypes B and D at the middle region of the pol sequences. Wefound that, in general, if more than two reference subtypes are veryclosely related, then during bootscan analysis they will competewith one another in the bootstrap procedure. Thus, the query couldnot get a consistent support to any of them.

Our analysis shows that version 2 performs generally betterthan version 1, especially in terms of sensitivity (see Table 2).Furthermore, we recorded the running time for both versions, andversion 2 is faster that version 1. Subtyping 4676 sequences byversion 1 takes more than 50 h, however, version 2 could give theresults in 26 h, on the same computer. The main pitfall of version 2seems to be the specificity in the assignment of subtype B. In thissubtype, there was a small decrease of specificity in version 2 whencompared to version 1 (Table 2).

A summary of the advantages and disadvantages of the twoversions of the subtyping tool is presented in Table 3.

4. Discussion

The REGA subtyping tool version 1, based on a streamlineprocedure, is more computationally intensive and time consuming,because for any query sequence all the same procedures have to bedone before reaching the final result. This takes some unnecessarytime on one simple case and means a lot more time when datasetsof thousands of sequences are analysed. Whereas subtyping toolversion 2, based on a Decision-Tree model, allows to subtype somesequences early in the Decision-Tree process, making theprocedure faster in these cases. This feature makes this versionfaster than version 1.

While a low false positive rate is assured in version 1, this isachieved at the price of a high number of false negative results.During the verification step of version 1, using the bootscan method,

any kind of conflict signal will be considered as uncertainty thusleaving the query unassigned. On the other hand, version 2 usesbootscan value to check whether that conflict signal is strong enoughto prevent assigning the query. This procedure resulted in 481sequences that were unassigned in version 1, while this number wasdecreased to 423 sequences in version 2, only at the cost of 5 extrafalse positive results in version 2 (Table 1).

Concerning the assignment of ‘‘pure’’ subtype sequences, themain difficulty of the tool, in both versions, was to assign subtype Bsequences. However, this can be explained by the high similaritybetween subtypes B and D and to the frequent occurrence ofcrossover between these two subtypes in the beginning of RT. Thisalso accounts for the high number of false negatives in subtype G,due to its history of recombination with subtype A (Abecasis et al.,2007). These problems will have to be resolved in the next versionof the subtyping tool.

The classification of CRF sequences might be difficult, especiallyif no recombination breakpoints are included in the analysedsequence. For example, our dataset only contained pol sequencesand CRF01_AE and CRF14_BG do not have recombination break-points in this region. However, rule 1C is expected to take this intoaccount, by verifying the inside/outside clustering of putativesequences of these CRFs with CRF reference sequences. Thisprocedure assigns a CRF even in the absence of breakpoints, if itclusters significantly within the cluster of reference CRF sequences,since these reference sequences were chosen only when a fullgenome is available thus assuring the CRF assignment. In the caseof CRF14_BG sequences of our dataset, no sequences were assignedby rule 1C but by rules 4 and 8 that do not verify the inside/outsideclustering of the query. Therefore, the correctness of this assign-ment is dubious and this is why we considered sequences assignedto CRF14_BG as subtype G.

Both versions of the subtyping tool try to avoid the difficulties ofdetermining the breaking point of recombinant sequences by usingCRF references. However, currently only CRF01-14 referencesequences are included in the reference set of the tool, thereforeall other CRFs (15–43) cannot be assigned by the tool. This is one ofthe reason why there is such a large number of unassignedsequences. Manual phylogenetic analysis assigned 30 BF recom-binants, and some of them have already been published in the LosAlamos sequence database as a new CRF. As the number of CRFscontinues to increase, it becomes infeasible to continue to includeall these CRF reference sequences in the reference set. Oneapproach could be that the subtyping tool assigns only based onpure subtype references, and after having determined the breakingpoint of the recombinant, a true phylogenetic relationship could beinferred for each segment. However, this procedure probably alsorepresents much more computation time. Another approach couldbe to verify which CRF is a reliably assigned CRF and plays asubstantial role in the epidemic, leaving out those CRFs that areexceedingly rare or are disappearing from the epidemic. This latterapproach is currently being investigated and we are preparing apublication assessing the relevance of assigning particular CRFs byan automated subtyping tool.

Although the performance of especially version 2 of thesubtyping tool was good as compared to manual assignment,we believe that improvements could be achieved by changing thecriteria rule. In the situation of two closely related subtypes (forexample subtypes B and D, or subtypes G and A), when a puresubtype query sequence is taken through the first two steps of theprocedure (Fig. 1) and appears to be related to two or moresubtypes, then the rule used to discriminate the true recombinantfrom a pure subtype with noise from the other subtype references(for example: subtype B with noise from subtype D) could beapplied with a less stringent bootscan value (currently 90%).Otherwise, the query will go through the rest of the Decision-Tree

A.B. Abecasis et al. / Infection, Genetics and Evolution 10 (2010) 380–385 385

and remain unassigned. The underlying reason is that when two ormore references related to the queries are very closely related inone of the analysed genomic regions, the bootstrap replicates willsometimes support one of the references and other times the other,therefore, none of these references will get a support above 70 andthe bootscan value will be lower than 90. So, this region will belabelled as unassigned. Optimally, a set of references that are asdivergent as possible could reduce this difficulty to some extent.But due to the nature of some subtypes, such as B and D that are notas divergent as the others, and A and G that have a history ofrecombination events – subtype G being in fact a CRF withCRF02_AG as parent instead of the other way round (Abecasis et al.,2007) – it will be very difficult to fine tune the criteria withoutintroducing too many false positive assignments. For all thereasons mentioned above, we might have to treat differentsubtypes differentially in our Decision-Tree algorithm.

The differences implemented in version 2 compared to version 1give the tool the strength to detect more pure subtype B sequenceswith about the same level of false positive rate. However, thediscrimination between subtype B and B/D recombinants remainsthe weak part of the subtyping tool, especially since the majority ofsequences submitted to the tool are still subtype B. This partlyaccounts for the reason why version 2 has to be run several times fora safer assignment. The other reason is that our data set containssequences from the pol region of the HIV-1 genome, where arelatively low phylogenetic signal exists. Also, no breaking pointexists in this region for CRF14_BG and CRF01_AE. CRF14_BG andCRF01_AE are therefore highly similar to the references of subtypesG and A, even though this problem is taken into account in version 2,by using the inside/outside clustering rule.

Finally, the correct identification of CRF06_CPX is hindered byrecombinants between CRFs. In our dataset, we identified 17 newrecombinants between CRF02_AG and CRF06_CPX, which have notbeen published in Los Alamos. Currently used programs arbitrarilyassigned these cases to CRF02 or CRF06. This is due to the fact thatthe REGA subtyping tool never includes two CRF referencesequences in the same bootscanning plot, therefore recombinantsbetween two CRFs are impossible to detect. Also this problem willhave to be solved in version 3 of the tool.

In this paper, we compared the performance of the two versions ofthe REGA subtyping tools. Version 2 showed 98.9% and 92.6%sensitivity and specificity, respectively, versus 97.6% and 93.8% ofversion 1, indicating an increase in sensitivity at the cost of a decreasein specificity. The overall decrease in specificity and especiallytowards subtype B of version 2 draws caution to the interpretation ofthe results provided by this tool. New algorithms for solving thisproblem should be tested. Changing the reference strains might be afirst step, however changes in the Decision-Tree will also benecessary. With the growth of recognized CRFs, careful considera-tions will need to be done to verify whether and which CRFs are ofsufficient importance to be included in a subtyping tool, knowingthat recombination always complicates assignments. Furthermore,increasing the number of bootstrap replicates should solve theunfrequent inconsistent assignments (especially to subtype B).

Acknowledgements

ABA was supported by a PhD grant from the Fundacao para aCiencia e Tecnologia (FCT). This work was partially supported byFWO grant (G.0611.09), by the programme for InteruniversitaireAttractiepolen (IUAP nr P6/41) and by the European Commission(EC grant CHAIN 7FP, 223131). The authors are grateful for thetraining received at the 14th International Bioinformatics Work-shop on Virus Evolution and Molecular Epidemiology, September2008, Cape Town (http://www.rega.kuleuven.be/cev/workshop/).

References

Abecasis, A.B., Deforche, K., Bacheler, L.T., McKenna, P., Carvalho, A.P., Gomes, P.,Vandamme, A.M., Camacho, R.J., 2006. Investigation of baseline susceptibility toprotease inhibitors in HIV-1 subtypes C, F, G and CRF02_AG. Antivir. Ther. 11,581–589.

Abecasis, A.B., Deforche, K., Snoeck, J., Bacheler, L.T., McKenna, P., Carvalho, A.P.,Gomes, P., Camacho, R.J., Vandamme, A.M., 2005. Protease mutation M89I/V islinked to therapy failure in patients infected with the HIV-1 non-B subtypes C, For G. AIDS 19, 1799–1806.

Abecasis, A.B., Lemey, P., Vidal, N., de Oliveira, T., Peeters, M., Camacho, R., Shapiro,B., Rambaut, A., Vandamme, A.M., 2007. Recombination confounds the earlyevolutionary history of human immunodeficiency virus type 1: subtype G is acirculating recombinant form. J. Virol. 81, 8543–8551.

Abecasis, A.B., Wensing, A.M.J., Vercauteren, J., Paraskevis, D., van de Vijver, D.A.,Albert, J., Asjo, B., Balotta, C., Bruckova, M., Camacho, R., Coughlan, S., Grossman,Z., Hamouda, O., Hatzakis, A., Horban, A., Korn, K., Kostrikis, L., Kucherer, C.,Nielsen, C., Poljak, M., Puchhammer-Stockl, E., Riva, C., Ruiz, L., Salminen, M.,Schmit, J.C., Schuurman, R., Sonnerborg, A., Stanekova, D., Stanojevic, M., Struck,D., Boucher, C.A.B., Vandamme, A.M., on behalf of the SPREAD-programme,2008. Demographic determinants of HIV-1 subtype distribution in Europe. In:6th European HIV Drug Resistance Workshop.

Baeten, J.M., Chohan, B., Lavreys, L., Chohan, V., McClelland, R.S., Certain, L.,Mandaliya, K., Jaoko, W., Overbaugh, J., 2007. HIV-1 subtype D infection isassociated with faster disease progression than subtype A in spite of similarplasma HIV-1 loads. J. Infect. Dis. 195, 1177–1180.

Buonaguro, L., Tornesello, M.L., Buonaguro, F.M., 2007. Human immunodeficiencyvirus type 1 subtype distribution in the worldwide epidemic: pathogenetic andtherapeutic implications. J. Virol. 81, 10209–10219.

de Oliveira, T., Deforche, K., Cassol, S., Salminen, M., Paraskevis, D., Seebregts, C.,Snoeck, J., van Rensburg, E.J., Wensing, A.M., van de Vijver, D.A., Boucher, C.A.,Camacho, R., Vandamme, A.M., 2005. An automated genotyping system foranalysis of HIV-1 and other microbial sequences. Bioinformatics 21, 3797–3800.

Gale, C.V., Myers, R., Tedder, R.S., Williams, I.G., Kellam, P., 2004. Development of anovel human immunodeficiency virus type 1 subtyping tool, subtype analyzer(STAR): analysis of subtype distribution in London. AIDS Res. Hum. Retroviruses20, 457–464.

Grossman, Z., Paxinos, E.E., Averbuch, D., Maayan, S., Parkin, N.T., Engelhard, D.,Lorber, M., Istomin, V., Shaked, Y., Mendelson, E., Ram, D., Petropoulos, C.J.,Schapiro, J.M., 2004. Mutation D30N is not preferentially selected by humanimmunodeficiency virus type 1 subtype C in the development of resistance tonelfinavir. Antimicrob. Agents Chemother. 48, 2159–2165.

Hemelaar, J., Gouws, E., Ghys, P.D., Osmanov, S., 2006. Global and regional distributionof HIV-1 genetic subtypes and recombinants in 2004. AIDS 20, W13–W23.

Holguin, A., Lopez, M., Soriano, V., 2008. Reliability of rapid subtyping toolscompared to that of phylogenetic analysis for characterization of humanimmunodeficiency virus type 1 non-B subtypes and recombinant forms. J. Clin.Microbiol. 46, 3896–3899.

Korber, B.T.M., Brander, C., Haynes, B.F., Koup, R., Kuiken, C., Moore, J.P., Walker, B.D.,Watkins, D.I., 2002. HIV Molecular Immunology. Los Alamos National Labora-tory, Theoretical Biology and Biophysics, Los Alamos, New Mexico.

Lole, K.S., Bollinger, R.C., Paranjape, R.S., Gadkari, D., Kulkarni, S.S., Novak, N.G.,Ingersoll, R., Sheppard, H.W., Ray, S.C., 1999. Full-length human immunodefi-ciency virus type 1 genomes from subtype C-infected seroconverters in India,with evidence of intersubtype recombination. J. Virol. 73, 152–160.

Los Alamos National Laboratory, 2005. Los Alamos Database. , http://www.hiv.lanl.gov/content/index.

Osmanov, S., Pattou, C., Walker, N., Schwardlander, B., Esparza, J., 2002. Estimatedglobal distribution and regional spread of HIV-1 genetic subtypes in the year2000. J. Acquir. Immune Defic. Syndr. 29, 184–190.

Palmer, S., Alaeus, A., Albert, J., Cox, S., 1998. Drug susceptibility of subtypes A, B, C,D, and E human immunodeficiency virus type 1 primary isolates. AIDS Res.Hum. Retroviruses 14, 157–162.

Pieniazek, D., Rayfield, M., Hu, D.J., Nkengasong, J., Wiktor, S.Z., Downing, R., Biryah-waho, B., Mastro, T., Tanuri, A., Soriano, V., Lal, R., Dondero, T., HIV Variant WorkingGroup, 2000. Protease sequences from HIV-1 group M subtypes A–H revealdistinct amino acid mutation patterns associated with protease resistance inprotease inhibitor-naive individuals worldwide. AIDS 14, 1489–1495.

Robertson, D.L., Anderson, J.P., Bradac, J.A., Carr, J.K., Foley, B., Funkhouser, R.K., Gao,F., Hahn, B.H., Kalish, M.L., Kuiken, C., Learn, G.H., Leitner, T., McCutchan, F.,Osmanov, S., Peeters, M., Pieniazek, D., Salminen, M., Sharp, P.M., Wolinsky, S.,Korber, B., 2000. HIV-1 nomenclature proposal. Science 288, 55–56.

Rozanov, M., Plikat, U., Chappey, C., Kochergin, A., Tatusova, T., 2004. A web-basedgenotyping resource for viral sequences. Nucleic Acids Res. 32, W654–W659.

Salminen, M.O., Carr, J.K., Burke, D.S., McCutchan, F.E., 1995. Identification ofbreakpoints in intergenotypic recombinants of HIV type 1 by bootscanning.AIDS Res. Hum. Retroviruses 11, 1423–1425.

Strimmer, K., von Haeseler, A., 1997. Likelihood-mapping: a simple method tovisualize phylogenetic content of a sequence alignment. Proc. Natl. Acad. Sci.U.S.A. 94, 6815–6819.

Swofford, D., 1998. PAUP* 4.0-Phylogenetic Analysis Using Parsimony (* and OtherMethods) Sinauer Associates, Sunderland, MA.

Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving thesensitivity of progressive multiple sequence alignment through sequenceweighting, position-specific gap penalties and weight matrix choice. NucleicAcids Res. 22, 4673–4680.