Comprehensive Analysis of High-Throughput Screening Data

13
* [email protected]; phone +41 61 697 5915; fax +41 61 697 7244; http://www.genedata.com; GeneData AG, Maulbeerstrasse 46, CH-4058 Basel, Switzerland Comprehensive Analysis of High-Throughput Screening Data Stephan Heyse * , GeneData AG, Basel, Switzerland ABSTRACT High-Throughput Screening (HTS) data in its entirety is valuable raw material for the drug-discovery process. It provides the most complete information about the biological activity of a company's compounds. However, its quantity, complexity and heterogeneity require novel, sophisticated approaches in data analysis. At GeneData, we are developing methods for large-scale, synoptical mining of screening data in a five-step analysis: (1) Quality Assurance: Checking data for experimental artifacts and eliminating low quality data. (2) Biological Profiling: Clustering and ranking of compounds based on their biological activity, taking into account specific characteristics of HTS data. (3) Rule-based Classification: Applying user-defined rules to biological and chemical properties, and providing hypotheses on the biological mode-of-action of compounds. (4) Joint Biological-Chemical Analysis: Associating chemical compound data to HTS data, providing hypotheses for structure-activity relationships. (5) Integration with Genomic and Gene Expression Data: Linking into other components of GeneData's bioinformatics platform, and assessing the compounds' modes-of-action, toxicity, and metabolic properties. These analyses address issues that are crucial for a correct interpretation and full exploitation of screening data. They lead to a sound rating of assays and compounds at an early stage of the lead-finding process. Keywords: high-throughput screening, multivariate statistical analysis, data quality, synoptical data mining, biological profiling, structure-activity relationship, mode-of-action classification 1. INTRODUCTION With the combination of robotic methods, genetic engineering, parallel processing and miniaturization of biological in- vitro experiments, HTS entered the scene almost ten years ago 1 . It is a method to identify new pharmacologically active compounds on a massive trial-and-error basis. In this respect, HTS opened a new, broad avenue for pharmacological research 2 alongside rational drug design approaches that required known ligand structures (chemical optimization 3,4,5 ) or crystal structures of target proteins (docking 6,7,8 ) as starting points. At the same time, HTS advanced the technology for conducting massively parallel, automated, and relatively inexpensive biological experiments. In screening labora- tories, testing more than 100,000 compounds a day has become routine 9 . Thus, automated mass screening for pharmacologically active compounds is now widely distributed, serving the identification of chemical compounds as starting points for optimization in primary screening, determination of activity, specificity, physiological and toxicolo- gical properties of large sub libraries (secondary screening), and verification of structure-activity hypotheses in focused libraries for lead optimization (tertiary screening) 10 . HTS can be considered as a massive, purely empirical, probabilistic approach for learning about interactions between molecular target (protein) and ligand. It employs a specialized, highly sensitive biological test system (assay) to measure the specific activity of a target in response to perturbations by a set of chemical compounds (the compound library), which are drawn from a more or less wide part of the chemical universe. Although the role of HTS is often still restricted to a pure “filtering for strong hits” as starting points for closer investigation in subsequent experiments and, finally, chemical optimization, there are tendencies to make wider use of screening data 11 . Several special characteristics of HTS data make it very valuable: Firstly, HTS data represents in its entirety the most complete information about the biological activity of all of a company's compounds. Test results accumulate in a company's database, and most of the compounds or related compounds have been measured in more than one biological assay. Some have even been evaluated in secondary and tertiary screens. Secondly, HTS data, in general, is of excellent quality. This is due to the screening assays being thoroughly revised and trimmed for ultimate sensitivity and stability in a long optimization process involving many scientists and the most modern technology. The automated screening process is designed to guarantee much higher stability of the relevant parameters than found in bench testing.

Transcript of Comprehensive Analysis of High-Throughput Screening Data

* [email protected]; phone +41 61 697 5915; fax +41 61 697 7244; http://www.genedata.com; GeneDataAG, Maulbeerstrasse 46, CH-4058 Basel, Switzerland

Comprehensive Analysis of High-Throughput Screening DataStephan Heyse*, GeneData AG, Basel, Switzerland

ABSTRACTHigh-Throughput Screening (HTS) data in its entirety is valuable raw material for the drug-discovery process. Itprovides the most complete information about the biological activity of a company's compounds. However, its quantity,complexity and heterogeneity require novel, sophisticated approaches in data analysis. At GeneData, we are developingmethods for large-scale, synoptical mining of screening data in a five-step analysis:(1) Quality Assurance: Checking data for experimental artifacts and eliminating low quality data.(2) Biological Profiling: Clustering and ranking of compounds based on their biological activity, taking into accountspecific characteristics of HTS data.(3) Rule-based Classification: Applying user-defined rules to biological and chemical properties, and providinghypotheses on the biological mode-of-action of compounds.(4) Joint Biological-Chemical Analysis: Associating chemical compound data to HTS data, providing hypotheses forstructure-activity relationships.(5) Integration with Genomic and Gene Expression Data: Linking into other components of GeneData's bioinformaticsplatform, and assessing the compounds' modes-of-action, toxicity, and metabolic properties.These analyses address issues that are crucial for a correct interpretation and full exploitation of screening data. Theylead to a sound rating of assays and compounds at an early stage of the lead-finding process.

Keywords: high-throughput screening, multivariate statistical analysis, data quality, synoptical data mining, biologicalprofiling, structure-activity relationship, mode-of-action classification

1. INTRODUCTION

With the combination of robotic methods, genetic engineering, parallel processing and miniaturization of biological in-vitro experiments, HTS entered the scene almost ten years ago 1. It is a method to identify new pharmacologically activecompounds on a massive trial-and-error basis. In this respect, HTS opened a new, broad avenue for pharmacologicalresearch 2 alongside rational drug design approaches that required known ligand structures (chemical optimization 3,4,5)or crystal structures of target proteins (docking 6,7,8) as starting points. At the same time, HTS advanced the technologyfor conducting massively parallel, automated, and relatively inexpensive biological experiments. In screening labora-tories, testing more than 100,000 compounds a day has become routine 9. Thus, automated mass screening forpharmacologically active compounds is now widely distributed, serving the identification of chemical compounds asstarting points for optimization in primary screening, determination of activity, specificity, physiological and toxicolo-gical properties of large sub libraries (secondary screening), and verification of structure-activity hypotheses in focusedlibraries for lead optimization (tertiary screening) 10.

HTS can be considered as a massive, purely empirical, probabilistic approach for learning about interactions betweenmolecular target (protein) and ligand. It employs a specialized, highly sensitive biological test system (assay) tomeasure the specific activity of a target in response to perturbations by a set of chemical compounds (the compoundlibrary), which are drawn from a more or less wide part of the chemical universe. Although the role of HTS is often stillrestricted to a pure “filtering for strong hits” as starting points for closer investigation in subsequent experiments and,finally, chemical optimization, there are tendencies to make wider use of screening data 11. Several specialcharacteristics of HTS data make it very valuable: Firstly, HTS data represents in its entirety the most completeinformation about the biological activity of all of a company's compounds. Test results accumulate in a company'sdatabase, and most of the compounds or related compounds have been measured in more than one biological assay.Some have even been evaluated in secondary and tertiary screens. Secondly, HTS data, in general, is of excellentquality. This is due to the screening assays being thoroughly revised and trimmed for ultimate sensitivity and stability ina long optimization process involving many scientists and the most modern technology. The automated screeningprocess is designed to guarantee much higher stability of the relevant parameters than found in bench testing.

In the ideal case, HTS would provide the first clues of the chemical properties that are essential for a desired biologicalactivity of compounds 12. Using sophisticated data analysis techniques, HTS results might sum up to a series of“blueprints” (building rules) for chemical structures of desired pharmacological activity and specificity. This wouldgreatly reduce the efforts for secondary testing, allow more focused experiments, speed up the whole discovery process,and improve the overall results. However, HTS data is generally very large, structurally heterogeneous and complex tointerpret biologically. Apart from data volume, sophisticated data analysis has to meet at least three challenges:

Firstly, potential measurement errors in HTS data must be quickly identified and excluded from subsequent processingby thorough quality control. This is of utmost importance at every stage of screening and analysis, since occasionalerrors may already obscure and invalidate the data due to the very low probability of finding real leads amongchemically diverse compound sets 13.

Secondly, each screening assay is different from the other in detail, even if they share targets from the same family. Theconsequence is to include assay-specific characteristics in annotations that can be referred to by automated analysistools 14, e.g. by multivariate algorithms for the evaluation of bioactivity profiles (see section 2.2).

Thirdly, the data set is very complex. A challenge lies in organizing all this information so that it remains interpretable.In the face of the mass of data produced, we do not believe that a solution is solely linking together databasescontaining all the screening, laboratory testing, and chemical information. The scientists whose joint work builds the“road to the lead" need to communicate efficiently. Only clear-cut analyses at each stage can optimize the process andfacilitate such communication. Our approach is to (i) perform thorough quality control to include only reliable data infurther processing stages, (ii) standardize, process, condense, and annotate data at each stage as much as needed by therespective experts involved, (iii) keep data related to its context using meta-data on assays, analyses, and compounds,and (iv) provide tools to analyze and interpret large data sets following statistical and biophysical models and standardprocesses. This makes the overall screening process transparent and traceable.

Sophisticated methods and specialized software help to meet these challenges and open the way for full exploitation ofHTS data. Combining results from different assays and screening stages and taking compound relationships intoaccount provides a wealth of clues about assay characteristics, possible errors and their sources, compound modes-of-action, and possibly compound structural elements that correlate with the activity. In the short term, this furthers thegoal to find leads quickly. In the long term, HTS as a science is building knowledge about target-ligand interactions andsmart and error-tolerant ways to measure and interpret them.

2. ANALYSIS PROCESS AND RESULTS2.0. DATA SOURCES

In a screening run, microtiterplates containing pre-dispensed test compounds are complemented with the biological testsystem, necessary substrates, cofactors and controls, and the biological activity is read out after a fixed incubation time.With the pressure to keep processing times short, the most common types of signals in primary screens are single-point,two-point, or flash kinetics measurements of optical parameters as a readout for an enzymatic reaction or the metabolicstate of a cell 9. Usually, this data is reduced in a pre-processing step to one or a few characteristic signal(s) per well.Given a thorough assay optimization beforehand 10, the signal will depend linearly (or at least monotonically) on theactivity of the biological target. With less compounds to handle in secondary screening, a more detailed experimentalinvestigation is common. It ranges from the measurement of extensive kinetics or small concentration series to paneltesting on different biological systems for specificity, toxicity and adsorption information. The high total of individualmeasurements and controls makes screening amenable to statistical methods despite the persisting rarity of truereplicate measurements of compounds in primary screens. For a complete monitoring of assay quality and for theapplication of some statistical normalizations at later stages, it is useful to have controls covering different signal levelspresent in several wells on every plate. These controls comprise: (i) blank controls - without biological system, yield thebaseline signal, (ii) neutral controls - without compound, yield signal and noise of the undisturbed assay, (iii) maximumeffect controls - reference agonist or antagonist at saturating concentration, yield the signal range, (iv) sensitivitycontrols - reference compound at a concentration eliciting a 50% biological response, yields a measure of assaysensitivity.

2.1. QUALITY ASSURANCE, NORMALIZATION, CORRECTION

The validity of conclusions based on screening results, e.g. hit selection or pharmacological annotation of activecompounds, depends clearly on the quality of the underlying data. Since in primary screening, truly active compoundsusually are very rare 13, already occasional errors may quite well obscure the real "hits". If any process andmeasurement errors can be identified as quickly and completely as possible and tracked back to their experimentalsource, screening data in general provide an excellent basis for the global and detailed qualification of a company´scompound set and provide first clues for compound optimization 16.

To guarantee this reliability, data quality control at different levels is a must. This begins in the optimization phase ofthe assay: In test runs with a small number of compound plates, the assay has to possess a sufficient signal window(e.g., Z-factor 17), stability, and sensitivity (e.g., measured by the effects of known control compounds) 1,10. If problemsoccur, the parameters of the assay or even its format should be tuned to match the quality criteria of HTS. A much moredifficult parameter to adjust is the robustness of an assay against small fluctuations in screening conditions (e.g.temperature) and against disturbances by non-specific compound activity (e.g. scavengers), which are identifiable onlyby cross-assay comparison or their characteristic structural elements later on, and result in high “hit” rates 15.

The subsequent screening campaign, in the ideal case, produces a set of data obtained under precisely reproducibleconditions from the first plate to the last, using a well defined and highly sensitive test system. To approach this ideal,two ingredients are needed: data quality control and normalization.

Figure 1: Trend Display of HTS data from a screening campaign, screenshot from GD ScreenerTM. Shown are the trends inabsolute signals with time, plate by plate for the first 280 plates. Each vertical greyscale histogram represents the distribution ofcompound signals from a single plate. Overlaid in yellow are the medians of signals from neutral controls (upper line) and from theblank controls (lower line). Red vertical lines denote run breaks. Large fluctuations of the signals are found. In some cases, thesignificance of the results was so low that the corresponding plates had to be repeated (red bars below the display).

Data quality control on the level of an individual assay seeks again to guarantee assay stability and sensitivity, whichmust be monitored constantly using the appropriate controls (Fig. 1). At the same time, it tries to pick up on processartifacts caused by failures in the screening machinery or the test system (e.g. a blocked pipettor needle, air bubbles inthe system, a changing metabolic state of reporter cells). If unnoticed, these can result in a high number of false-positive, but seemingly “highly specific hits”. Often, such process artifacts can be detected by changes in the overallsignal or by specific “signal patterns” on plates (e.g. pipettor line patterns, Fig. 2), if the compound library israndomized across the screening plates. This analysis is preferably done directly after the screening run to ensure thatsuch patterns can be traced back to their origin (e.g. the pipettor may be inspected the next morning) and can beunambiguously classified as artifacts - or nonartifacts. Specialized software supports the scientist who must sift throughthe large screening data sets. It visualizes data in the context of the complete assay, allowing comparisons acrossindividual screening plates and screening runs with respect to signal, variability, sensitivity and prevalent patterns (Fig.2). Furthermore, it uses modern statistical methods to automatically identify possible errors and “problem plates” (Fig.3), and displays them to the scientist. He finally decides – on the basis of his knowledge of the concrete screeningexperiment - whether to assess them as artifacts and mask the results as invalid or whether to believe in a true “ activitypattern” from a family of closely related compounds.

Figure 2: Overview of the results of a complete screening assay (769 plates). In the Main Window, each rectangle corresponds toa plate, each colored square to a well of the plate, color-coded according to its normalized signal using the Scale. All values aregiven as a percentage of the neutral controls. The plate sequence is to be read line by line from left to right. The Plate Lens shows azoomed view of one of the plates (green circle) exhibiting a line-wise pipettor pattern (rows with higher signal red, with lower signalblue) and a signal gradient from left to right. Masked wells are colored grey. Screenshots from GD ScreenerTM.

Normalization of screening data, the first step in data analysis, serves the comparability of measurements across platesin an assay (which may easily consist of a thousand plates). It is necessary, since every biological test system exhibitsvariations over time, and even more across different preparations. Data are normalized on a plate-by-plate basis, since aplate is the processing unit and all wells are affected in the same way by any time trends during the assay. Usually,normalization consists of subtracting the baseline and then dividing the raw signals from the plate’s wells by a“reference” signal obtained from the neutral controls. A linear relationship between signal and biological activity is aprerequisite. Often normalization works well in eliminating effects of signal fluctuation. Comparing Fig. 1 and Fig. 2,normalized signals and hit rates seem to vary rather little across long stretches of plates despite fluctuations in absolutesignal. However, controls monitoring assay sensitivity on a plate-by-plate basis would definitely be useful here.

Figure 3: Automated pattern detection. Panel A, pattern analysis of a screening assay, comprising five subsequent runs. On theleft, the four typical well patterns are extracted from the normalized data set of the screen. Pattern 1 is a combination of an eight-channel pipettor pattern with a diagonal gradient from the upper right to the lower left corner of the plates and elevated signals incolumn 19. It is present throughout the assay (right panel, Pattern #1). Th second pattern is essentially a signal gradient from right toleft, predominant in run 1, and, in inverse orientation, in run 4. Patterns 3 and 4 are mainly found in run 4, which, due to the manyartifacts present, is of questionable quality. Panel B, results of correcting the assay for the well patterns shown in panel A. Shown arethe signal distributions of the normalized compound data (grey histogram) and of the pattern-corrected data (blue histogram). Signalsare given as a percentage of controls (100% = inactive compound). Correction reduces the standard deviation of the measurementnoise around the peak of "inactives" by half, as indicated by the gauss fits to the main peak of the distribution (dotted lines).

While normalization assures global comparability of signals between different plates, it does not guarantee that themeasurements in different wells of each single plate are comparable in the presence of gradients. These signal gradientsare quite frequent in HTS due to inhomogeneity of parameters such as temperature, humidity, incubation times orconcentrations in different wells across a plate - factors that even the most elaborate optimization cannot completelyeradicate. Typically, these gradients are quite repetitive along a stretch of plates measured in succession. Thisdistinguishes them from real actives that should be more or less randomly dispersed when considering a whole series ofplates from reasonably randomized compound collections. Statistical deconvolution methods can identify commonpatterns on groups of plates (Fig. 3). These methods provide the scientist with a quick overview of strong gradients orpatterns in the assay, and form the basis for the decision whether certain plates must be repeated or should simply becorrected. The gradient correction adds discriminatory power to the assay results, since it renders results perfectlycomparable independent of plate location and reduces the noise introduced by gradients (Fig. 3). In the case of non-random compound distribution on the plates (e.g. in retests containing many actives) such correction methods still canbe applied if there are solvent plates interspersed in the screening run at regular intervals.

Quality control, normalization and correction discussed so far all deal with phenomena that are process- and assay-related. These steps result in a set of corrected, high-quality, reliable data that may then be stored in a company'sscreening database. They form the basis for analyses on a higher level, such as biological activity profiling ofcompounds. However, compound-related problems and undesired effects may still be hidden in this data set, since theseare much harder to detect. There are three kinds of these: (i) compound stability and handling problems - aging effectsand fluid handling problems occur at low frequencies, resulting e.g. in empty wells. (ii) compound influence on theassay scheme - disturbance of the readout by certain compound properties, e.g. color, scavenging properties etc. (iii)non-specific mode-of-action - most of the “hits” found in a primary screen do not address the target for screening, butinterfere elsewhere in the assay, e.g. on a key molecule in a biological signal cascade leading to the reporter gene. Thechallenge lies in separating these modes-of-action. To this end, cross-assay comparison and integration of other data isessential on the scale of a complete company´s compound collection. This is the subject of the next sections.

2.2. STANDARDIZATION AND BIOLOGICAL PROFILING

Once the set of corrected, high-quality results of an assay is available, it may be related to the results of other assays,given that these have undergone the same rigorous quality checks. This relation yields more or less complete “activityprofiles” for the compounds that had been tested in many assays over time. These profiles may further becomplemented by annotations resulting from previous analyses (e.g. mode-of-action flags such as “protein synthesisinhibitor”) and by information on chemical structures and chemical compound relationships. This synoptical view of thecompound set allows much more detailed conclusions on compound activity, enables a much more differentiated hitselection than a hard signal cutoff (e.g. “50% inhibition”) and provides hints on structure-activity relationships (SAR).The combination of rigorous quality control and hit selection based on complete activity profiles shifts the focus fromthe compounds yielding the “strongest” signals in an assay (which may be false-positives or non-specific) towards the“best” hits in terms of balanced activity, specificity, and rapid optimization.

Figure 4: Signal distributions of four selected cellular reporter gene assays, comprising single-point luminescence measurementsof 100,000 compounds each. The signals were scaled using compounds with known activity on the target (0 = inactive compound). Inthe case of the inhibitor-assays #1 and #2, the reference compounds were known antagonists, in the case of the stimulator-assays #3and #4, the reference compounds were known agonists. Despite normalization, the signal distributions differ widely in their range,width, and skewness.

The challenge lies in standardizing, structuring and condensing this complex information to keep it interpretable. Atfirst glance, each assay differs a lot from the other due to a different target and set-up, which reflects the diversity ofbiology. There are efforts to counter this problem by increasingly using modular assay systems (same set-up fordifferent targets) and screening of complete target families, which also increases process efficiency. But a priori, thecomparison of data across screening assays remains difficult. As an example, Fig. 4 shows the signal distribution of fourcellular assays, normed using control compounds with known activity on the target. Two assays were designed asassays for antagonists (assays #1 and #2), two as assays for agonists (assays #3 and #4). For their different signalranges, different measurement noise, and different skewness of their distributions, these four assays are not directlycomparable on a quantitative scale.

However, biological profiling is still possible. We consider it as a two-step analysis process: In a first step, thequantitative information from different assays is rescaled, if necessary, to make it comparable. To match measurementsfrom very different assays, we firstly group them into compound sets of the same qualitative behavior (on an abstract"activity scale", Fig. 6) and afterwards make quantitative comparisons within each group. In a second step, these areplaced into the biological context of the assays, since each assay measures different “effects”.

For many purposes, rescaling is not mandatory. In Fig. 5, the task of finding specific inhibitors in an "assay #4" amongcompounds measured in a panel of eight assays is illustrated. With a sequential approach, compounds showing lowsignals in assay #4 are selected first, taking global assay statistics as a guide for a relatively “soft” threshold. This yieldsa set of 11,514 out of 100,000 compounds. This “inhibitor set” shows a diverse pattern of activity in all the other assays.In a second step, this compound set is filtered again for low variance of signals across the other seven assays to yield thefinal result of 1,033 specific inhibitors in assay #4. The advantage of this approach as compared to the “classical”application of a filtering pipeline with hard thresholds is twofold: (i) the focus here is on specificity, since the firstselection step is collecting a large set of compounds both with strong and weak, but significant activity knowing thatthis will be refined later using specificity criteria. (ii) the approach can be made more error-tolerant, since the second(specificity) filtering step integrates over the seven “reference” assays, tolerating slight activity in a single assay out ofseven to a certain extent while still classifying the compound as “specific”.

Figure 5: Hit selection. Left Panel: Shown is the signal distribution in one assay (Assay #4). The scale is set so that a zero valuecorresponds to signals of compounds showing no activity. Selection of a set of 11,514 inhibitors showing at least 44% inhibition inthis assay is indicated in brown (corresponding to a statistical significance of >90% for activity). This group of compounds isprocessed further. UpperRight Panel: Specificity profiles of these inhibitors in the other seven assays of the panel. Each grey linerepresents a compound, connecting the signals it gave in the assays 1-3 and 5-8. Most of the compounds show side effects, i.e.signals significantly different from zero in at least one of the other assays. A variance filter (slider on the histogram) selectscompounds of this set that exhibit only minor side-effects (red highlighted profiles). The Lower Right Panel shows the profile plotfor these compounds across all assays; these compounds are specific inhibitors in Assay #4.

Another, less user-guided approach to hit selection is the clustering of bioactivity profiles. For this to work properly, thesignal scales in different assays must be closely related or made comparable by transformation. An example of this isshown in Fig. 6, where the complete set of compounds measured in the 8 assay panels has been clustered on the basis ofan activity scale. A cluster of specific inhibitors in assay #4 is identified, too, now comprising 2000 compounds due toless strict conditions.

Apart from the statistical assay comparability, comparability in biological terms must be addressed. Assays can beclassified according to biological and technical categories 14. The former would comprise target family and class,biological readout system employed, and cell line. The latter would comprise detection technology and its parameters,signal level, processing sequence, incubation time, and compound concentration. Those non-target related parametershave considerable influence on the outcome of an assay, as illustrated in Fig. 7. The correlation plot in Fig. 7 sorts theeight assays into two major groups: assays#1 to #4 and assays #5 to #8, the latter group split up again in two anti-correlated sub-groups. Simply speaking, correlation is a measure of the common set of compounds showing activity inboth of each two assays. Not surprisingly, the splits reflect the respective set-up of the assays: common hits are frequentamong assays using the same reporter cell line, albeit equipped with a different target receptor, and common hits arealso a consequence of detection technology and signal level. To find the compounds that give rise to such target-unrelated ties between the assays (with the purpose of finding their target molecule), discriminant analysis andstatistical tests are applied. An example is shown in Fig. 7, where compounds with inhibitory action against cell line A,but no effect on cell line B are identified. The next step would be to ask for the biological differences between the twocell lines are and, if relevant, design tests to unravel the true mode-of-action of these compounds on a certain target incell line A.

Figure 6: Clustering of 40,000 compounds according to their bioactivity in 8 assays using a self-organized map algorithm. Here,the clustering was done on an activity scale derived from the signals measured in screening. In the Upper Panel, the typical activityprofile for each cluster is shown (total 40 clusters) together with the number of compounds that were attributed to that cluster. "A"denotes a cluster of specific inhibitors in Assay #4, "B" marks a cluster of compounds that exhibit cell-line-specific inhibition. TheLower Panel shows the profile display for all compounds; the respective cluster is color-coded.

Figure 7: Assay Correlation and Detection of compounds discriminating between screening cell lines. Left Panel: Correlationbetween the individual assays. A group of assays, denoted by "A", shows high correlation (red area), i.e. a large common "hit set".Within another group of assays, denoted by "B", assays are also correlated, either positively (red) or negatively (green). The twogroups "A" and "B" correspond to screens using two different reporter cell lines. Assay #5 and assay #8 were assays for stimulators,denoted by "C". Within group "B", these are negatively correlated to the corresponding inhibitor assays #6 and #7, i.e. a stimulator inassays #5 and #9 often exhibited inhibitory action in assays #6 and #7. Right Panel: T test between the two groups of assays based ontwo different cell lines. Here, 60 compounds that discriminate distinctly between the cell lines are identified (profiles colored brown).

2.3. RULE-BASED CLASSIFICATION

On the basis of biological activity profiles, the next analysis step is building hypotheses about modes-of-action (MOA)of compounds or to classify the compounds in the screened set according to known MOA. The goal is to obtaininformation at this early stage as to which “real” biological target sites the compounds might address. This in-silicoapproach can deal with large numbers of compounds. It is still based on experimental HTS data and simply limitssubsequent biochemical experiments to those compounds that have an attractive profile and an interesting MOAhypothesis.

There are two methods of classification: it is based on either external information about the assays, assigning a logicallyplausible activity profile to a certain mode-of-action, or on the measurement of reference compounds of known MOA.The first approach is illustrated in Fig. 8: Here, compounds of a desired profile were looked for. In assay #4, a certain Gprotein-coupled receptor was coupled via an inhibitory G protein and downstream signaling cascade to the reportergene; in assay #5, with a different cell line, the coupling was via a stimulatory G protein. A receptor agonist shouldconsequently exhibit the search profile shown: signal decrease in assay #4, increase in assay #5.

The second approach for classification requires the regular measurement of a series of reference compounds (such as aset of known drugs) in screening assays, providing reference bioactivity profiles for the individual MOAs. This data setis used to classify new compounds. If those are few, classification is done by co-clustering of the test compoundstogether with the reference compounds. For classifying a large compound set, it is preferable to build a rule set from thereference data, either in form of a tree classifier 18 or as a support vector machine classifier 19, and finally apply it to thecomplete compound set to assign the different MOAs. This classification may not be unambiguous, because usually thereference compounds do not cover all cases and their profiles may overlap to a certain extent 20. If one is interested inone defined mode-of-action only and wants to exclude others, this classification directly ranks compounds based on theproximity of their profiles to the desired MOA and forms the basis of a hit list.

Figure 8: Finding profile-similar compounds. Profile display showing compound signals in a panel of 8 assays. The desiredactivity profile is drawn in black. It demands inhibitor activity in Assay #4, stimulator activity in Assay #5, on the basis that theseassays share a similar target, but its activation is inversely coupled to the readout in the two assays. 30 compounds that closely matchthis profile are identified (brown profiles), other compounds that exhibit the opposite profile are shown in blue.

2.4. JOINT BIOLOGICAL - CHEMICAL ANALYSIS

The ultimate goal of the analysis of HTS data is to identify the rules for the compound’s chemical functionality that isrecognized by the target molecule and alters its function in the desired way. Currently, this analysis of quantitativestructure-activity relationships (QSAR) is normally used for the detailed analyis of small numbers of structurally relatedmolecules 11 for which also a detailed set of biological activity data is available. In the future, compound librariescontaining many distinct sets of analogues 16 and focused library screening 21 will also provide such bioactivity data ofchemically related compounds. Two of the biggest challenges for direct “SAR from HTS” currently are data errors andthe “impurity” of data due to the mixed origins of the observed “activity” (e.g. a true enzyme inhibitor and a generalprotein destabilizing agent may show the same “activity” in an enzyme assay, but have completely different SARs).However, if the analyses mentioned in the previous sections are applied - if data is rigorously quality-checked and apurity of results is achieved by biological hit profiling and filtering out compounds of unrelated mode-of-action - HTSdata will grow to the role of providing first-hand structure-activity information in high quality and never experiencedquantity. This “shortcut” to chemical optimization again has the advantages of saving work and replacing sequentialselection processes by a more holistic multi-dimensional optimization, which in the end has the potential of providing amore efficient way to lead compounds. For example, compounds with weaker activity but with an interesting profile orcompounds pointing to a promising SAR may be chosen for follow-up investigations. Apart from these processarguments, much can be learned about the target and the chemistry to follow or to avoid for addressing it, for fine-tuning of activity and side-activities.

Figure 9: Matching chemical clusters to bioactivity profiles. 4,100 compounds were tested in 8 assays and clustered according totheir chemical structure using two-dimensional fingerprints. The tree shows the chemical clustering. The coloring of individualcompounds in the enlarged section of the tree indicates the corresponding bioactivity profile, illustrated to the right. A large part ofthe magnified cluster consists of compounds of similar bioactivity (inhibition across several assays, yellow, purple and brownprofiles). But there are also chemically closely related compounds of a different mode-of-action (e.g., CPD 14895 & CPD 14680,green profile).

An example of such an analysis is shown in Fig. 9. There, hit compounds have been clustered hierarchically accordingto their two-dimensional structural fingerprint and set in relation to the biological activity profiles they evoke. Agraduated SAR (with respect to side-effects on related targets) exists in parts of a chosen chemical compound cluster,but other related compounds exhibit a completely different activity profile. The question is which structural differenceseparates the former and the latter, which was not captured by the clustering – e.g. some small functional group or analiphatic extension at a certain position? More generally speaking, the complete HTS data set can also be used todistinguish between different approaches of constructing chemical family relationships, e.g. different descriptor sets, bylooking for partial correlations between chemical compound hierarchies and biological activity patterns.

2.5. INTEGRATION WITH GENOMIC AND GENE EXPRESSION DATA

Screening data provides high-level, functional information on compound effects measured with high sensitivity.Activity profiles based on several assays can be used to separate different modes of action. The results may be evenbetter understood when embedded in the genomic context of the targets and the biological assay systems. Apart from itsmerits in target identification 22,23, genomic information provides a classification of targets into families 24 and of assaysystems in groups of e.g. common predominant signal pathways 25, which may explain non-specific effects ofcompounds in certain groups of assays. This view on the underlying biology of screening experiments is particularlyhelpful when interpreting activity profiles or hunting for hidden targets 26.

Figure 10: Mode-of-action classification of compounds using gene expression data. The colored bar on top shows the color-coded expression profile of CHO cells treated with a compound, relative to untreated cells. Individual genes are aligned horizontally.Genes with increased expression levels upon treatment are colored red, genes with decreased expression levels green. Thisexpression profile is mapped automatically to a series of expression profiles previously obtained by treating CHO cells withcompounds of known action (a mode of action reference compendium). By similarity to reference profiles, the test compound isclassified as DNA synthesis inhibitor.

Currently, more in-depth investigations of compound mode-of-action and side effects are conveniently provided byanalysis of cellular gene expression patterns and their modification by applied compounds 27,28. Its advantage lies in theability to detect slight changes in the metabolic state of treated cells by analyzing expression of thousands of geneswithout a priori knowledge of the kind of expected effect. This method finds increasing application in tertiary screening

(does the compound really act on the desired target?) and in toxicology studies (does the compound have noxious side-effects?). In Fig. 10, the application of gene expression experiments to classify a hit compound is shown exemplarily.Similar to MOA classification on the basis of screening data, reference compounds were used to build a compendium ofexpression profiles associated with certain MOAs and to find marker genes. On the basis of these, the screening hit isclassified as a probable DNA synthesis inhibitor and, consequently, is not processed further.

3. CONCLUSIONS

Screening data has high potential and value that goes far beyond the individual assay. It gives information on potentialmodes-of-action of compounds, allows smart hit selection, and may hint at structure-activity relationships. Aprerequisite for leveraging this value is a good analysis process, supported by specialized software. Screening data mustbe analyzed with care – thorough quality control is a must. Screening data tends to be very complex, so different levelsof detail should be available, thus allowing researchers in different functions to quickly answer their individualquestions. Intelligent analysis of screening data helps to cope with that complexity, in that (i) screening data are pre-processed in a suitable way (ii) data is reduced and abstracted at different levels, but still remains embedded in itscontext, and (iii) intelligent statistical algorithms assist the researcher in examining and analyzing the large data sets,respecting their biological context and building on their process knowledge.

REFERENCES

1. B. Cox, J.C. Denyer, A. Binnie, M.C. Donnelly, B. Evans, D.V.S. Green, J.A. Lewis, T.H. Mander, A.T. Merritt,M.J. Valler, and S.P. Watson, "Application of high-throughput screening techniques to drug discovery", Prog.Med. Chem 37, 83 – 133, 2000.

2. R.W. Spencer, "High-throughput screening of historic collections: observations on file size, biological targets, andfile diversity", Biotechnol. Bioeng. 61 (1), 61 – 67, 1998.

3. H.-J. Böhm, and G. Schneider (eds.) Virtual screening for bioactive molecules, VCH, New York, 2000.4. D.P. Marriott, I.G. Dougall, P. Meghani, Y.J. Liu, and D.R. Flower, "Lead generation using pharmacophore

mapping and three-dimensional database searching: application to muscarinic M(3) receptor antagonists", J. Med.Chem. 42, 4103 – 4112, 1999.

5. G.W.A. Milne, M.C. Nicklaus, and S. Wang, "Pharmacophores in drug design and discovery", SAR QSAR Environ.Res. 9 (1-2), 23 – 38, 1998.

6. B.K. Shoichet, and D.E. Bussiere, "Macromolecular crystallography and lead discovery: possibilities andlimitations", J. Mol. Biol. 295, 337 – 356, 2000.

7. P.J. Gane, and P.M. Dean, "Recent advances in structure-based rational drug design", Curr Opin. Struct. Biol. 10,401 – 404, 2000.

8. L. Balbes, S. Mascarella, and D. Boyd, "A perspective of modern methods in computer-aided drug design". In K.Lipkowitz and D.B. Boyd (eds), Reviews in computational chemistry 5, 337 – 370, VCH, Weilheim, 1994.

9. R.P. Hertzberg, and A.J. Pope, "High-troughput screening: new technology for the 21st century", Curr. Opin.Chem. Biol. 4, 445 – 451, 2000.

10. M. Lutz, and T. Kenakin, Quantitative molecular pharmacology and informatics in drug discovery, John Wiley &Sons, New York, 2000.

11. P. Gedeck, and P. Willett, "Visual and computational analysis of structure-activity relationships in high-throughputscreening data", Curr. Opin. Chem. Biol. 5, 389 – 395, 2001.

12. H. Gao, C. Williams, P. Labute, and J. Bajorath, "Binary quantitative structure - activity relationship (QSAR)analysis of estrogen receptor ligands", J. Chem. Inf. Comput. Sci. 39, 164 – 168, 1999.

13. M.M. Hann, A.R. Leach, and G. Harper, "Molecular complexity and its impact on the probability of finding leadsfor drug discovery", J. Chem. Inf. Comput. Sci. 41, 856 – 864, 2001.

14. B.R. Roberts, "Screening informatics: adding value with meta-data structures and visualization tools", Drug Disc.Today 5 (1) suppl., 10 – 14, 2000.

15. G.M. Rishton, "Reactive compounds and in vitro false positives in HTS", Drug Disc. Today 2 (9), 382 – 384, 1997.

16. E.H. Ohlstein, R.R. Ruffolo Jr., J.D. Elliott, "Drug discovery in the next millennium", Annu. Rev. Pharmacol.Toxicol. 40, 177 - 191, 2000.

17. J.-H. Zhang, T.D.Y. Chung, and K.R. Oldenburg, "A simple statistical parameter for use in evaluation andvalidation of high-throughput screening assays, J. Biomol. Screen. 4 (2), 67-73, 1999.

18. L. Breiman, J. Friedman, R. Olshen, and C.J. Stone, Classification and Regression Trees, Chapman and Hall, NewYork, 1984.

19. N. Cristianini and J. Shawe-Taylor, Support Vector Machines and other kernel-based learning methods, CambridgeUniversity Press, Cambridge, 2000.

20. N.S. Gray, L. Wodicka, A.-M.W.H. Thunnissen, T.C. Norman, S. Kwon, F.H. Espinoza, D.O. Morgan, G. Barnes,S. LeClerc, L. Meijer, S.-H. Kim, D.J. Lockhart, P.G. Schultz, "Exploiting chemical libraries, structure, andgenomics in the search for kinase inhibitors", Science 281, 533 - 538, 1998.

21. K. Illgen, T. Enderle, C. Broger, and L. Weber, "Simulated molecular evolution in a full combinatorial library",Chemistry and Biology 7 (6), 433-41, 2000.

22. C. Freiberg, "Novel computational methods in anti-microbial target identification", Drug Disc. Today 6 (15) suppl.,S72 - S80, 2001.

23. L. J. Beeley, D. M. Duckworth, and C. Southan, "The Impact of Genomics on Drug Discovery", Progr. Med.Chem. 37, 1-43, 2000.

24. R.L. Tatusov, D.A. Natale, I.V. Garkavtsev, T.A. Tatusova, U.T. Shankavaram, B.S. Rao, B. Kiryutin, M.Y.Galperin, N.D. Fedorova, E.V. Koonin, "The COG database: new developments in phylogenetic classification ofproteins from complete genomes", Nucleic Acids Res 29(1), 22 - 28, 2001.

25. C.V. Forst, K. Schulten, "Evolution of metabolisms: a new method for the comparison of metabolic pathways usinggenomics information", J. Comput. Biol. 6 (3-4), 343-60, 1999.

26. J.E. Staunton, D.K. Slonim, H.A. Coller, P. Tamayo, M.J. Angelo, J. Park, U. Scherf, J.K. Lee, W.O. Reinhold,J.N. Weinstein, J.P. Mesirov, E.S. Lander, T.R. Golub, "Chemosensitivity prediction by transcriptional profiling",PNAS 98 (19), 10787-92, 2001.

27. U. Scherf, D.T. Ross, M. Waltham, L.H. Smith, J.K. Lee, L. Tanabe, K.W. Kohn, W.C. Reinhold, T.G. Myers,D.T. Andrews, D.A. Scudiero, M.B. Eisen, E.A. Sausville, Y. Pommier, D. Botstein, P.O. Brown, J.N. Weinstein,"A gene expression database for the molecular pharmacology of cancer", Nature Genet. 24 (3), 236 - 244, 2000.

28. T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas, J. Buhler, J.K. Eng, R. Bumgarner, D.R. Goodlett, R. Aebersold,L. Hood, "Integrated genomic and proteomic analyses of a systematically perturbed metabolic network", Science292 (5518), 929 - 934, 2001.