PubChem BioAssay: 2014 update

8
PubChem BioAssay: 2014 update Yanli Wang*, Tugba Suzek, Jian Zhang, Jiyao Wang, Siqian He, Tiejun Cheng, Benjamin A. Shoemaker, Asta Gindulyte and Stephen H. Bryant* National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Received September 12, 2013; Revised September 30, 2013; Accepted October 1, 2013 ABSTRACT PubChem’s BioAssay database (http://pubchem. ncbi.nlm.nih.gov) is a public repository for archiving biological tests of small molecules generated through high-throughput screening experiments, medicinal chemistry studies, chemical biology research and drug discovery programs. In addition, the BioAssay database contains data from high- throughput RNA interference screening aimed at identifying critical genes responsible for a biological process or disease condition. The mission of PubChem is to serve the community by providing free and easy access to all deposited data. To this end, PubChem BioAssay is integrated into the National Center for Biotechnology Information re- trieval system, making them searchable by Entrez queries and cross-linked to other biomedical information archived at National Center for Biotechnology Information. Moreover, PubChem BioAssay provides web-based and programmatic tools allowing users to search, access and analyze bioassay test results and metadata. In this work, we provide an update for the PubChem BioAssay resource, such as information content growth, new developments supporting data integration and search, and the recently deployed PubChem Upload to streamline chemical structure and bioassay submissions. INTRODUCTION The PubChem BioAssay database (http://pubchem.ncbi. nlm.nih.gov) (1–4) is a public repository for biological activity data of small molecules and RNAi reagents, hosted by the National Center for Biotechnology Information (NCBI) (5), a division of the National Library Medicine under the National Institutes of Health since 2004. BioAssay test results are linked to the chemical structures of tested small molecules and the sequencing data of screened RNA interference (RNAi) reagents as available. In addition, the information content in the BioAssay database is linked to several bio- medical and literature databases hosted at NCBI, including PubMed, Protein, Gene, Nucleotide, BioSystems, Taxonomy, OMIM and protein 3D structure associated with bioassay targets. PubChem is committed to offer biomedical researchers free access to this information. BioAssay data can be searched, accessed and analyzed by Entrez queries as well as via a suite of web-based and programmatic tools provided by PubChem, making PubChem a widely used public infor- mation system for accelerating chemical biology research and drug development. Table 1 provides a summary for BioAssay services and the corresponding URLs. Most of the web-based services can also be accessed at http:// pubchem.ncbi.nlm.nih.gov/assay. Developing and managing a public archive system for complex bioassay data has been both challenging and re- warding. In the past 9 years, PubChem has come a long way to manage the rapidly growing data and meet the increasing demand from the community. PubChem has become a leading public bioassay data repository by (i) supporting broad types of bioactivity information with an optimized bioassay data standard, (ii) maintaining steady enhancement of database infrastructure and scal- ability, (iii) providing and enhancing a streamlined data upload system, (iv) integrating with other biomedical in- formation resources and (v) expanding and empowering search, retrieval, analysis and download tools. In this work, we provide an update on several aspects of the in- formation resource, including data content growth, database infrastructure consolidation, new search indices, project-based bioassay links and newly developed web services including target-based bioactivity data tools and the recently deployed PubChem Upload system. BioAssay DATA CONTENT GROWTH The BioAssay database has been growing substantially during the past years (Figure 1). As of 1 September 2013, the BioAssay database has received >700 000 *To whom correspondence should be addressed. Tel: +1 301 435 7811; Fax:+1 301 435 7793; Email: [email protected] Correspondence may also be addressed to Stephen H. Bryant. Tel: +1 301 435 7792; Fax:+1 301 435 7793; Email: [email protected] Nucleic Acids Research, 2013, 1–8 doi:10.1093/nar/gkt978 Published by Oxford University Press 2013. This work is written by US Government employees and is in the public domain in the US. Nucleic Acids Research Advance Access published November 5, 2013 at National Institutes of Health Library on December 12, 2013 http://nar.oxfordjournals.org/ Downloaded from

Transcript of PubChem BioAssay: 2014 update

PubChem BioAssay 2014 updateYanli Wang Tugba Suzek Jian Zhang Jiyao Wang Siqian He Tiejun Cheng

Benjamin A Shoemaker Asta Gindulyte and Stephen H Bryant

National Center for Biotechnology Information National Library of Medicine National Institutes of HealthBethesda MD 20894 USA

Received September 12 2013 Revised September 30 2013 Accepted October 1 2013

ABSTRACT

PubChemrsquos BioAssay database (httppubchemncbinlmnihgov) is a public repository for archivingbiological tests of small molecules generatedthrough high-throughput screening experimentsmedicinal chemistry studies chemical biologyresearch and drug discovery programs In additionthe BioAssay database contains data from high-throughput RNA interference screening aimed atidentifying critical genes responsible for a biologicalprocess or disease condition The mission ofPubChem is to serve the community by providingfree and easy access to all deposited data To thisend PubChem BioAssay is integrated into theNational Center for Biotechnology Information re-trieval system making them searchable by Entrezqueries and cross-linked to other biomedicalinformation archived at National Center forBiotechnology Information Moreover PubChemBioAssay provides web-based and programmatictools allowing users to search access and analyzebioassay test results and metadata In this work weprovide an update for the PubChem BioAssayresource such as information content growth newdevelopments supporting data integration andsearch and the recently deployed PubChemUpload to streamline chemical structure andbioassay submissions

INTRODUCTION

The PubChem BioAssay database (httppubchemncbinlmnihgov) (1ndash4) is a public repository for biologicalactivity data of small molecules and RNAi reagentshosted by the National Center for BiotechnologyInformation (NCBI) (5) a division of the NationalLibrary Medicine under the National Institutes ofHealth since 2004 BioAssay test results are linked to thechemical structures of tested small molecules and the

sequencing data of screened RNA interference (RNAi)reagents as available In addition the informationcontent in the BioAssay database is linked to several bio-medical and literature databases hosted at NCBIincluding PubMed Protein Gene NucleotideBioSystems Taxonomy OMIM and protein 3D structureassociated with bioassay targets PubChem is committedto offer biomedical researchers free access to thisinformation BioAssay data can be searched accessedand analyzed by Entrez queries as well as via a suite ofweb-based and programmatic tools provided byPubChem making PubChem a widely used public infor-mation system for accelerating chemical biology researchand drug development Table 1 provides a summary forBioAssay services and the corresponding URLs Most ofthe web-based services can also be accessed at httppubchemncbinlmnihgovassayDeveloping and managing a public archive system for

complex bioassay data has been both challenging and re-warding In the past 9 years PubChem has come a longway to manage the rapidly growing data and meet theincreasing demand from the community PubChem hasbecome a leading public bioassay data repository by (i)supporting broad types of bioactivity information withan optimized bioassay data standard (ii) maintainingsteady enhancement of database infrastructure and scal-ability (iii) providing and enhancing a streamlined dataupload system (iv) integrating with other biomedical in-formation resources and (v) expanding and empoweringsearch retrieval analysis and download tools In thiswork we provide an update on several aspects of the in-formation resource including data content growthdatabase infrastructure consolidation new searchindices project-based bioassay links and newly developedweb services including target-based bioactivity data toolsand the recently deployed PubChem Upload system

BioAssay DATA CONTENT GROWTH

The BioAssay database has been growing substantiallyduring the past years (Figure 1) As of 1 September2013 the BioAssay database has received gt700 000

To whom correspondence should be addressed Tel +1 301 435 7811 Fax +1 301 435 7793 Email ywangncbinlmnihgovCorrespondence may also be addressed to Stephen H Bryant Tel +1 301 435 7792 Fax +1 301 435 7793 Email bryantncbinlmnihgov

Nucleic Acids Research 2013 1ndash8doi101093nargkt978

Published by Oxford University Press 2013 This work is written by US Government employees and is in the public domain in the US

Nucleic Acids Research Advance Access published November 5 2013 at N

ational Institutes of Health L

ibrary on Decem

ber 12 2013httpnaroxfordjournalsorg

Dow

nloaded from

depositions of bioassays (Figure 1A) Counting solely thelatest version of each bioassay record by accession (ieAID) the database contains 200 000 000 bioactivityoutcome summaries (Figure 1B) and 1 200 000 000 datapoints representing biological properties for 2 800 000small molecule samples 1 900 000 chemical structures

and 108 000 RNAi reagents (Figure 1C) This informationrepresents tens of thousands of potential modulators forgt8000 protein targets and 30 000 genes critical for biolo-gical process hence providing rich information onchemical and RNAi tools for chemical and molecularbiology research

Table 1 A list of PubChem BioAssay services

Service Description URL example

BioAssay service home Access a list of BioAssay services httppubchemncbinlmnihgovassay

BioAssay search Search BioAssay database with Entrez httpwwwncbinlmnihgovpcassay

BioAssay search advanced page An interface for searching multiple search fields httpwwwncbinlmnihgovpcassaylimits

BioAssay text search advanced page An interface for reviewing search history andrefining search results with Boolean operation

httpwwwncbinlmnihgovpcassayadvanced

BioAssay summary Access and download a bioassay record http pubchemncbinlmnihgovassayassaycgiaid=myAID

BioAssay data retrieval tool Retrieve a full data table or an active subset froma single bioassay record

http pubchemncbinlmnihgovassayassaydatahtmlaid=myAID

http pubchemncbinlmnihgovassayassaydatahtmlact=actampaid=myAID

BioAssay data selection tool Select a user-defined data subset from a singlebioassay record

http pubchemncbinlmnihgovassayassaycgiq=tampaid=myAID

Bioactivity data tool Retrieve multiple-assay bioactivity data for asingle substance sample (SID) chemical struc-ture (CID) protein target (GI) or gene target(GeneID)

http pubchemncbinlmnihgovassaycgisid=mySID

http pubchemncbinlmnihgovassaycgisid=myCID

http pubchemncbinlmnihgovassaycgisid=myGI

http pubchemncbinlmnihgovassaycgisid=myGeneID

BioActivity summary(compound-centric)

Summarize and analyze bioactivity data for a setof records presented from the compound pointof view

httppubchemncbinlmnihgovassaybioactivitycgitab=1

BioActivity summary (assay-centric) Summarize and analyze bioactivity data for a setof records presented from the assay point ofview

httppubchemncbinlmnihgovassaybioactivitycgitab=2

BioActivity summary (target-centric) Summarize and analyze bioactivity data for a setof records presented from the target point ofview

httppubchemncbinlmnihgovassaybioactivitycgitab=3

Structure-activity relationshipanalysis (SAR)

Analyze and visualize structure-activity relation-ship with clustering tools and a heatmap-styledisplay

httppubchemncbinlmnihgovassayassaycgip=heat

Scatter plothistogram Analyze bioassay test results with histogram orscatter plot

httppubchemncbinlmnihgovassayplotcgiplottype=2

Dose-response curve tool Analyze bioassay test results and visualize dose-response curve

httppubchemncbinlmnihgovassayplotcgiplottype=1

Related BioAssay Summarize bioassay relationship by overlap ofactive compounds target sequence similaritydeposited annotation same publicationcommon pathways and same assay project

httppubchemncbinlmnihgovassayassayHeatmapcgi

PubChem PUGSOAP PubChem programmatic tool for data retrieval httppubchemncbinlmnihgovpugpughelphtml

PUGREST PubChem REST api for data retrieval httppubchemncbinlmnihgovpug_restPUG_RESThtml

Bioassay download tool A flexible download interface httppubchemncbinlmnihgovassayassaydownloadcgi

BioAssay FTP FTP for all PubChem BioAssay records andrelated information

ftpftpncbinlmnihgovpubchemBioassay

BioAssay data standard XML data specification for PubChem BioAssaydata model

ftpftpncbinlmnihgovpubchemdata_spec

PubChem upload Substance and bioassay submission system httppubchemncbinlmnihgovupload

2 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

The content in the PubChem BioAssay database iscontributed by gt50 organizations worldwide includingUS government-funded institutions pharmaceuticalcompanies research laboratories and collaboratorshosting chemical biology databases A summary ofbioassay vendors and submission counts is provided athttppubchemncbinlmnihgovsourcesassayBioAssay datasets added during the past 2 years include (i)small molecule data from screening centers of the NIHMolecular Libraries and Imaging Program [MolecularLibrary Program (MLP)] (httpcommonfundnihgovmolecularlibraries) ICCB-LongwoodNSRB ScreenFacility at the Harvard Medical School (httpiccbmedharvardedu) EPA Tox21 (httpepagovncctTox21)and Milwaukee Institute for Drug Discovery (httpwww4uwmedudrugdiscovery) (ii) a curated datasetrecords from the Meiler Lab at Vanderbilt Universitywhich derives the ultimate bioactivity outcome of asmall molecule by combining multiple bioassay results inPubChem to facilitate cheminformatics studies (6) (iii)curated datasets from literature extraction by IUPHAR-DB (7) and ChEMBL (8) and (iv) small interfering RNA(siRNA) data from Drosophila RNAi Screening CenterICCB-LongwoodNSRB Screening Facility at theHarvard Medical School (httpiccbmedharvardedu)Cancer Research UK Cambridge Research InstituteDepartment of Molecular Cell Biology at WeizmannInstitute of Science Institut National de la Sante et dela Recherche Medicale (INSERM) Peterson Lab atGenentech and ten Dijke Lab at Leiden UniversityMedical Center Many of these newly added siRNAdatasets are associated with recent publications injournals such as Nature Cell Biology (9ndash11) GenomeResearch (12) J Virol (13) Cancer Research (14)PNAS (1516) Nature (17ndash19) Science (2021) andNature Genetics (22) Each of these bioassay records islinked to the corresponding abstract in PubMedallowing PubChem users to track down the publicationeasily Vice versa users of PubMed also gain accessto the corresponding bioassay datasets through thiscross-linkPubChem continues to mirror the ChEMBL database

(8) hosted at the European Bioinformatics InstituteMultiple ChEMBL releases and database changes overthe past 2 years have been incorporated into PubChemRecently added annotations at ChEMBL are recorded viathe Categorized Comment field of the PubChem BioAssaydata model (1) Binding surface ligand and lipophilicligand efficiency indices are added to a bioassay recordas additional test results As a result many of thebioassay records in PubChem have gone throughmultiple updates Annotation for bioactivity outcome(eg active or inactive) is largely missing in theChEMBL datasets hindering their integration with therest of PubChem data and analysis tools In such a casePubChem now assigns bioactivity outcome using a50 mM cutoff based on readouts such as IC50 EC50or Ki allowing a larger portion of the ChEMBL datablended in the PubChem systemF

igure

1Growth

inPubChem

BioAssay(A

)Records

(B)bioactivityoutcomes

(countedbyAID

ndashSID

pair)and(C

)uniquetested

samples

Nucleic Acids Research 2013 3

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

DATABASE INFRASTRUCTURE ENHANCEMENT

A robust and scalable database system is crucial tosupport the rapid growth of PubChem BioAssay A setof relational databases and tables is designed and set upon Microsoft SQL servers to (i) accept bioassay submis-sion from depositors (ii) archive bioassay update withversion control (iii) track embargo status (iv) recordand derive links and relationships among bioassays andother biomedical information (v) provide search indexes(vi) support fast data retrieval and analysis and (vii) facili-tate daily update at the FTP site Challenged by theaccelerated growth of bioassay data content greatefforts have been invested in the past years to enhancethe database infrastructure capacity by both hardwareupgrade and revised database design As a result newservices have been added to the PubChem resourceFurthermore performance in bioassay data retrieval anddownload services have been significantly improvedthereby significantly eliminating a queuing system tominimize the user wait time

DATA INTEGRATION AND NEW WEB SERVICES

The PubChem BioAssay database is fully integrated withother biomedical databases hosted by NCBI and providesa suite of web-based and programmatic tools to supportdata access retrieval analysis and download fromPubChem or cross-linked databases (Table 1) Severalnew services for integrating bioassay target and bioactivitydata or grouping bioassays based on an assay project aredescribed later Other developments that have focused onbehind-the-scene enhancement of data retrieval withoutsignificant web interface change will not be summarizedin this work

Rapid access of bioactivity data for a protein orgene target

PubChem BioAssay closes the gap between molecular andchemical biology research by presenting and linking up in-formation of both chemical and RNAi tools in one systemsupporting the study of gene function and biologicalpathways The majority of small molecule screening datain PubChem are associated with protein targets whileRNAi screening data links each tested reagent to a genePubChem provides multiple mechanisms for cross-referencing protein and gene targets from bioactivitydata (1) As a result a protein or gene may link to manybioactivity datasets It is critical to provide rapid access tosuch multi-assay bioactivity data for these protein and genetargets Such a service provides a unique annotation serviceto the corresponding Entrez Protein or Gene record whichleads users to experimental data from chemical biology andRNAi research enhancing the discoverability of the NCBIEntrez system Toward this end two new services theProtein Target Bioactivity Data Tool and the Gene TargetBioactivity Data Tool were developed respectively toaccess associated bioactivity information in PubChemFrom a protein target record such as G-protein-

coupled receptor (GPCR) 35 (httpwwwncbinlmnihgovproteinNP_0052922) bioactivity data for this

protein target can be accessed by the link lsquoBioAssay byTarget (Summary)rsquo As shown in Figure 2A this ProteinTarget Bioactivity Data Tool draws and identifies eachtested substance together with its bioactivity resultsassay title and a link to detailed data such as dose-response curves The data table is sorted by bioactivityoutcome and potency of the substances by defaultshowing first active data and potent reagents Graphicalfilters are provided at the top of the page allowing one todrill down to a data subset of onersquos interest For examplethis GPCR protein has a lsquoProbersquo filter highlighting threechemical probes discovered by a high-throughputscreening (HTS) project for selective GPR35 antagonists

The bioactivity data for the relevant gene target record(httpwwwncbinlmnihgovgene2859) can be accessedby the link lsquoBioAssay by Target (Summary)rsquo With thisGene Target Bioactivity Data Tool a similar summaryof relevant bioassay activity results is displayed asshown in Figure 2B Note that using a gene identifier inthis case additional data are retrieved including RNAitest results (as indicated with the filter lsquoRNAirsquo shownunder lsquoSubstance Typesrsquo) which indicates that GPR35functions as a cellular gene repressing HPV18 LCR asidentified by a genome-wide siRNA screen This exampleillustrates the power of aggregating bioactivity data acrossdatasets onto a unified display The Gene TargetBioactivity Data Tool is particularly useful for accessingdatasets from multiple depositors and literature-baseddata from many journal articles Moreover it links simul-taneously to findings in chemical biology research andRNAi screenings enabling users to evaluate the biologicalrole of a gene and to identify its small molecular regula-tors using data shown on the same display

BioAssays associated with the same assay project

PubChem tracks the relationships among bioassay recordsas indicated by submitters PubChem has also developedseveral computational methods for identifying additionalbioassay linkages based on target sequence similaritycommon active compounds and biological pathways aswell as datasets abstracted from the same publication(1) To better support decision making PubChem nowclusters and links up bioassays based on assay projectsThis feature aims to use data deposited by a network suchas the NIH MLP and the Tox21 program MLP-fundedscreening laboratories are required to deposit data pro-gressively into PubChem as an assay project continuesIt usually takes months or years to finish an assayproject aimed at developing chemical probe hence oftenmultiple bioassay datasets are submitted to PubChem forthe same project but under distinct accessions (AIDs)These datasets are highly relevant often covering aprimary HTS result follow-ups with dose-response andtoxicity testing or counter screenings against biologicallyrelated targets different cell lines or using different assaymethods PubChem allows submitters to specify such re-lationships via the cross-reference (XRef) data field Onthe other hand it is up to the submitters to provide alllinks as new data are made available As a result cross-references to related bioassay datasets unfortunately may

4 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

Figure 2 Bioactivity data for a (A) protein target and (B) gene target

Nucleic Acids Research 2013 5

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

be lacking or incomplete among many datasets making itdifficult for users to discover these key associationsTo improve this situation it is now a common practice to

create a lsquoSummaryrsquo bioassay at the outset of a multi-assayproject and then link each subsequent-related assay back tothat summary record This means that the submitter onlyneeds to specify a single link for each bioassay record to thesame summary and all other links between related assaysare automatically generated As a result assay projects areindexed on top of the individual records Users visiting anybioassay record can access all relevant datasets of the sameproject without the need for the submitter to specify allconnections As shown in Figure 3 the links to theserelated bioassays are labeled in the BioAssay Summaryservice as lsquoSame Projectrsquo under the lsquoRelated BioAssaysrsquosection The Modulation of the Metabotropic GlutamateReceptor mGluR3 (GRM3) assay (httppubchemncbinlmnihgovassayassaycgiaid=651839) indicates onlyone lsquoDepositor Specifiedrsquo assay whereas eight bioassayrecords were identified as related to the same project bythe new procedure One may see details of the related bio-assays by clicking the link lsquoSame Projectrsquo

PUBLIC ACCESS

BioAssay record and BioAssay summary service

A PubChem BioAssay record can be accessed via theBioAssay Summary service at httppubchemncbinlmnihgovassayassaycgi where myAID is a validBioAssay accession (AID) As shown in Figure 3 for theGRM3 assay (AID 651839) the BioAssay Summaryservice provides (i) full access to submitted informationincluding bioassay protocol descriptions assay dataand cross-references (ii) derived bioassay relationshipsand (iii) tools for evaluating tested compounds studyingSAR or researching target For the lsquoTargetrsquo section alink lsquoMore Bioactivity datarsquo has been recently addedto gather all bioactivity data in PubChem associatedwith the GRM3 target The BioAssay Summary servicenow provides instant access to bioassay data table andenhanced function for data download with improveddatabase infrastructure With the recently launchedPubChem Social Media outreach links to social mediaaccounts are now provided on this page

Figure 3 BioAssay Summary page for bioassay record AID 651839 New and enhanced features are highlighted including fast download instantaccess to data table link to additional bioactivity data targeting GRM3 link to related bioassays on the same project and links to social mediaaccount

6 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

BioAssay search

Keyword search in the PubChem BioAssay database issupported by NCBI Entrez at httpwwwncbinlmnihgovpcassay Textual information in PubChemBioAssay is indexed under numerous fields Anadvanced interface is provided at httpwwwncbinlmnihgovpcassaylimits (Limits page) to access multipleindices and filters (1) Based on information provided incategorized comment fields and keywords in the title of abioassay record new filters were added to support theidentification of records containing (i) biochemical assay(ii) cell-based assay (iii) proteinndashprotein interaction bio-activity and (iv) in vivo or in vitro assay A newly addedmenu lsquoAssay Projectrsquo can be used to select an assay projectand accessing related datasets ChEMBL depositor infor-mation is also indexed to support sub-setting ChEMBLrecords As a result although httpwwwncbinlmnihgovpcassayterm=ChEMBL[sourcename] retrieves allChEMBL bioassays in PubChem httpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3AScientific+Literature225BSourceName5D[SourceName] re-trieves literature-based records from ChEMBL andhttpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3ASt+Jude+Malaria+Screening225BSourceName5D[SourceName] retrieves ChEMBL records de-posited by St Jude Malaria Screening

PubChem BioAssay FTP AND DOWNLOAD

PubChem provides multiple services for users todownload bioassay records which have been describedpreviously (1) This primarily includes (i) an enhanceddownload function at the Summary service (shown inFigure 3) (ii) a web-based BioAssay download serviceat httppubchemncbinlmnihgovassayassaydownloadcgi with a flexible interface supporting full or partial datadownload by specifying bioassay accessions (AIDs) andtested substance accessions (SIDs) and (iii) daily updatedPubChem BioAssay FTP at ftpftpncbinlmnihgovpubchemBioassay providing open access to all bioassaydatasets While the primary FTP structure remains thesame one new FTP directory lsquoExtrasrsquo is added to offeradditional information of the BioAssay resource In thisfolder the file lsquoCid2BioactivityLinkrsquo provides a list oftested compounds and the corresponding URLs linkingto associated bioactivity data Similarly thelsquoGi2BioactivityLinkrsquo and lsquoGeneid2BioactivityLinkrsquo filesprovide the list of the corresponding bioactivity datalinks for protein and gene targets respectively ThelsquoAid2GiGeneidrsquo contains all the bioassay (AID) pro-tein target (GI) and gene target (Gene ID) associationsin the BioAssay database Also a file for assayproject-based related bioassays is added to the directoryat ftpftpncbinlmnihgovpubchemBioassayAssayNeighbors Column headers for the comma-separatedvalues (CSV) format has been modified to provide con-sistency among multiple download methods (ftpftpncbinlmnihgovpubchemBioassayCSVREADME)Readout names are now provided in CSV files to ease dataparsing and interpretation In addition PubChem PUG

SOAP (httppubchemncbinlmnihgovpugpughelphtml) and PUGREST (httppubchemncbinlmnihgovpug_restPUG_RESThtml) facilities are being de-veloped to support programmatic retrieval of bioassayinformation

PubChem UPLOAD FOR BioAssay SUBMISSION

As a public repository handling diverse and vast amountsof chemical structure and bioassay data it is critical forPubChem to provide an efficient and user-friendly way toupload data The recently released PubChem Upload(httppubchemncbinlmnihgovupload) makes use ofadvances in web technologies to offer streamlinedsupport for data submissions and updates to theSubstance and BioAssay databases PubChem Uploadsupports all functionalities and data exchange formats ofits predecessor (1) Furthermore it provides an extensiveset of wizards inline help tips and tutorials for guidingsubmitters to enter assay data and descriptive informa-tion More specifically the new assay submissioncapabilities offered by PubChem Upload include (i)bioassay submission wizards to assist novice users forboth small molecule and RNAi screenings (ii) improveduser interface response to complex input with newer webtechnology (iii) simplified new user registration upgradesfor production user accounts (iv) improved helpincluding hints built into user interface and tutorial (v)extensive PubChem bioassay templates for new submis-sions or for record updates (vi) full editing and integra-tion of assay data and description tables and (vii)expanded importexport handling of spreadsheets forassays A detailed help document tutorial and samplesubmission templates for PubChem Upload are availableat httppubchemncbinlmnihgovuploaddocsupload_helphtml httppubchemncbinlmnihgovuploadtutorial and httppubchemncbinlmnihgovuploaddocsupload_helphtmlAssaySubmission respectively Adetailed description of PubChem Upload will be providedin a separate article

SUMMARY

PubChem is committed to serve as a public repository forbioactivity data of small molecules and RNAi PubChemalso provides an integrated information platform with asuite of tools allowing users to query analyze anddownload all database content PubChem will continueto improve services and tools as technology advancesand to further integrate the information it contains tothird party annotations and other public biomedicaldata With the support of open access to the data andthe delivery of the new Upload system PubChemwelcomes the community to use the resource and to con-tribute data content to the repository

ACKNOWLEDGEMENTS

The authors thank all submitters who have contributeddata to PubChem and the rest of the PubChem team fortheir support

Nucleic Acids Research 2013 7

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

depositions of bioassays (Figure 1A) Counting solely thelatest version of each bioassay record by accession (ieAID) the database contains 200 000 000 bioactivityoutcome summaries (Figure 1B) and 1 200 000 000 datapoints representing biological properties for 2 800 000small molecule samples 1 900 000 chemical structures

and 108 000 RNAi reagents (Figure 1C) This informationrepresents tens of thousands of potential modulators forgt8000 protein targets and 30 000 genes critical for biolo-gical process hence providing rich information onchemical and RNAi tools for chemical and molecularbiology research

Table 1 A list of PubChem BioAssay services

Service Description URL example

BioAssay service home Access a list of BioAssay services httppubchemncbinlmnihgovassay

BioAssay search Search BioAssay database with Entrez httpwwwncbinlmnihgovpcassay

BioAssay search advanced page An interface for searching multiple search fields httpwwwncbinlmnihgovpcassaylimits

BioAssay text search advanced page An interface for reviewing search history andrefining search results with Boolean operation

httpwwwncbinlmnihgovpcassayadvanced

BioAssay summary Access and download a bioassay record http pubchemncbinlmnihgovassayassaycgiaid=myAID

BioAssay data retrieval tool Retrieve a full data table or an active subset froma single bioassay record

http pubchemncbinlmnihgovassayassaydatahtmlaid=myAID

http pubchemncbinlmnihgovassayassaydatahtmlact=actampaid=myAID

BioAssay data selection tool Select a user-defined data subset from a singlebioassay record

http pubchemncbinlmnihgovassayassaycgiq=tampaid=myAID

Bioactivity data tool Retrieve multiple-assay bioactivity data for asingle substance sample (SID) chemical struc-ture (CID) protein target (GI) or gene target(GeneID)

http pubchemncbinlmnihgovassaycgisid=mySID

http pubchemncbinlmnihgovassaycgisid=myCID

http pubchemncbinlmnihgovassaycgisid=myGI

http pubchemncbinlmnihgovassaycgisid=myGeneID

BioActivity summary(compound-centric)

Summarize and analyze bioactivity data for a setof records presented from the compound pointof view

httppubchemncbinlmnihgovassaybioactivitycgitab=1

BioActivity summary (assay-centric) Summarize and analyze bioactivity data for a setof records presented from the assay point ofview

httppubchemncbinlmnihgovassaybioactivitycgitab=2

BioActivity summary (target-centric) Summarize and analyze bioactivity data for a setof records presented from the target point ofview

httppubchemncbinlmnihgovassaybioactivitycgitab=3

Structure-activity relationshipanalysis (SAR)

Analyze and visualize structure-activity relation-ship with clustering tools and a heatmap-styledisplay

httppubchemncbinlmnihgovassayassaycgip=heat

Scatter plothistogram Analyze bioassay test results with histogram orscatter plot

httppubchemncbinlmnihgovassayplotcgiplottype=2

Dose-response curve tool Analyze bioassay test results and visualize dose-response curve

httppubchemncbinlmnihgovassayplotcgiplottype=1

Related BioAssay Summarize bioassay relationship by overlap ofactive compounds target sequence similaritydeposited annotation same publicationcommon pathways and same assay project

httppubchemncbinlmnihgovassayassayHeatmapcgi

PubChem PUGSOAP PubChem programmatic tool for data retrieval httppubchemncbinlmnihgovpugpughelphtml

PUGREST PubChem REST api for data retrieval httppubchemncbinlmnihgovpug_restPUG_RESThtml

Bioassay download tool A flexible download interface httppubchemncbinlmnihgovassayassaydownloadcgi

BioAssay FTP FTP for all PubChem BioAssay records andrelated information

ftpftpncbinlmnihgovpubchemBioassay

BioAssay data standard XML data specification for PubChem BioAssaydata model

ftpftpncbinlmnihgovpubchemdata_spec

PubChem upload Substance and bioassay submission system httppubchemncbinlmnihgovupload

2 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

The content in the PubChem BioAssay database iscontributed by gt50 organizations worldwide includingUS government-funded institutions pharmaceuticalcompanies research laboratories and collaboratorshosting chemical biology databases A summary ofbioassay vendors and submission counts is provided athttppubchemncbinlmnihgovsourcesassayBioAssay datasets added during the past 2 years include (i)small molecule data from screening centers of the NIHMolecular Libraries and Imaging Program [MolecularLibrary Program (MLP)] (httpcommonfundnihgovmolecularlibraries) ICCB-LongwoodNSRB ScreenFacility at the Harvard Medical School (httpiccbmedharvardedu) EPA Tox21 (httpepagovncctTox21)and Milwaukee Institute for Drug Discovery (httpwww4uwmedudrugdiscovery) (ii) a curated datasetrecords from the Meiler Lab at Vanderbilt Universitywhich derives the ultimate bioactivity outcome of asmall molecule by combining multiple bioassay results inPubChem to facilitate cheminformatics studies (6) (iii)curated datasets from literature extraction by IUPHAR-DB (7) and ChEMBL (8) and (iv) small interfering RNA(siRNA) data from Drosophila RNAi Screening CenterICCB-LongwoodNSRB Screening Facility at theHarvard Medical School (httpiccbmedharvardedu)Cancer Research UK Cambridge Research InstituteDepartment of Molecular Cell Biology at WeizmannInstitute of Science Institut National de la Sante et dela Recherche Medicale (INSERM) Peterson Lab atGenentech and ten Dijke Lab at Leiden UniversityMedical Center Many of these newly added siRNAdatasets are associated with recent publications injournals such as Nature Cell Biology (9ndash11) GenomeResearch (12) J Virol (13) Cancer Research (14)PNAS (1516) Nature (17ndash19) Science (2021) andNature Genetics (22) Each of these bioassay records islinked to the corresponding abstract in PubMedallowing PubChem users to track down the publicationeasily Vice versa users of PubMed also gain accessto the corresponding bioassay datasets through thiscross-linkPubChem continues to mirror the ChEMBL database

(8) hosted at the European Bioinformatics InstituteMultiple ChEMBL releases and database changes overthe past 2 years have been incorporated into PubChemRecently added annotations at ChEMBL are recorded viathe Categorized Comment field of the PubChem BioAssaydata model (1) Binding surface ligand and lipophilicligand efficiency indices are added to a bioassay recordas additional test results As a result many of thebioassay records in PubChem have gone throughmultiple updates Annotation for bioactivity outcome(eg active or inactive) is largely missing in theChEMBL datasets hindering their integration with therest of PubChem data and analysis tools In such a casePubChem now assigns bioactivity outcome using a50 mM cutoff based on readouts such as IC50 EC50or Ki allowing a larger portion of the ChEMBL datablended in the PubChem systemF

igure

1Growth

inPubChem

BioAssay(A

)Records

(B)bioactivityoutcomes

(countedbyAID

ndashSID

pair)and(C

)uniquetested

samples

Nucleic Acids Research 2013 3

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

DATABASE INFRASTRUCTURE ENHANCEMENT

A robust and scalable database system is crucial tosupport the rapid growth of PubChem BioAssay A setof relational databases and tables is designed and set upon Microsoft SQL servers to (i) accept bioassay submis-sion from depositors (ii) archive bioassay update withversion control (iii) track embargo status (iv) recordand derive links and relationships among bioassays andother biomedical information (v) provide search indexes(vi) support fast data retrieval and analysis and (vii) facili-tate daily update at the FTP site Challenged by theaccelerated growth of bioassay data content greatefforts have been invested in the past years to enhancethe database infrastructure capacity by both hardwareupgrade and revised database design As a result newservices have been added to the PubChem resourceFurthermore performance in bioassay data retrieval anddownload services have been significantly improvedthereby significantly eliminating a queuing system tominimize the user wait time

DATA INTEGRATION AND NEW WEB SERVICES

The PubChem BioAssay database is fully integrated withother biomedical databases hosted by NCBI and providesa suite of web-based and programmatic tools to supportdata access retrieval analysis and download fromPubChem or cross-linked databases (Table 1) Severalnew services for integrating bioassay target and bioactivitydata or grouping bioassays based on an assay project aredescribed later Other developments that have focused onbehind-the-scene enhancement of data retrieval withoutsignificant web interface change will not be summarizedin this work

Rapid access of bioactivity data for a protein orgene target

PubChem BioAssay closes the gap between molecular andchemical biology research by presenting and linking up in-formation of both chemical and RNAi tools in one systemsupporting the study of gene function and biologicalpathways The majority of small molecule screening datain PubChem are associated with protein targets whileRNAi screening data links each tested reagent to a genePubChem provides multiple mechanisms for cross-referencing protein and gene targets from bioactivitydata (1) As a result a protein or gene may link to manybioactivity datasets It is critical to provide rapid access tosuch multi-assay bioactivity data for these protein and genetargets Such a service provides a unique annotation serviceto the corresponding Entrez Protein or Gene record whichleads users to experimental data from chemical biology andRNAi research enhancing the discoverability of the NCBIEntrez system Toward this end two new services theProtein Target Bioactivity Data Tool and the Gene TargetBioactivity Data Tool were developed respectively toaccess associated bioactivity information in PubChemFrom a protein target record such as G-protein-

coupled receptor (GPCR) 35 (httpwwwncbinlmnihgovproteinNP_0052922) bioactivity data for this

protein target can be accessed by the link lsquoBioAssay byTarget (Summary)rsquo As shown in Figure 2A this ProteinTarget Bioactivity Data Tool draws and identifies eachtested substance together with its bioactivity resultsassay title and a link to detailed data such as dose-response curves The data table is sorted by bioactivityoutcome and potency of the substances by defaultshowing first active data and potent reagents Graphicalfilters are provided at the top of the page allowing one todrill down to a data subset of onersquos interest For examplethis GPCR protein has a lsquoProbersquo filter highlighting threechemical probes discovered by a high-throughputscreening (HTS) project for selective GPR35 antagonists

The bioactivity data for the relevant gene target record(httpwwwncbinlmnihgovgene2859) can be accessedby the link lsquoBioAssay by Target (Summary)rsquo With thisGene Target Bioactivity Data Tool a similar summaryof relevant bioassay activity results is displayed asshown in Figure 2B Note that using a gene identifier inthis case additional data are retrieved including RNAitest results (as indicated with the filter lsquoRNAirsquo shownunder lsquoSubstance Typesrsquo) which indicates that GPR35functions as a cellular gene repressing HPV18 LCR asidentified by a genome-wide siRNA screen This exampleillustrates the power of aggregating bioactivity data acrossdatasets onto a unified display The Gene TargetBioactivity Data Tool is particularly useful for accessingdatasets from multiple depositors and literature-baseddata from many journal articles Moreover it links simul-taneously to findings in chemical biology research andRNAi screenings enabling users to evaluate the biologicalrole of a gene and to identify its small molecular regula-tors using data shown on the same display

BioAssays associated with the same assay project

PubChem tracks the relationships among bioassay recordsas indicated by submitters PubChem has also developedseveral computational methods for identifying additionalbioassay linkages based on target sequence similaritycommon active compounds and biological pathways aswell as datasets abstracted from the same publication(1) To better support decision making PubChem nowclusters and links up bioassays based on assay projectsThis feature aims to use data deposited by a network suchas the NIH MLP and the Tox21 program MLP-fundedscreening laboratories are required to deposit data pro-gressively into PubChem as an assay project continuesIt usually takes months or years to finish an assayproject aimed at developing chemical probe hence oftenmultiple bioassay datasets are submitted to PubChem forthe same project but under distinct accessions (AIDs)These datasets are highly relevant often covering aprimary HTS result follow-ups with dose-response andtoxicity testing or counter screenings against biologicallyrelated targets different cell lines or using different assaymethods PubChem allows submitters to specify such re-lationships via the cross-reference (XRef) data field Onthe other hand it is up to the submitters to provide alllinks as new data are made available As a result cross-references to related bioassay datasets unfortunately may

4 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

Figure 2 Bioactivity data for a (A) protein target and (B) gene target

Nucleic Acids Research 2013 5

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

be lacking or incomplete among many datasets making itdifficult for users to discover these key associationsTo improve this situation it is now a common practice to

create a lsquoSummaryrsquo bioassay at the outset of a multi-assayproject and then link each subsequent-related assay back tothat summary record This means that the submitter onlyneeds to specify a single link for each bioassay record to thesame summary and all other links between related assaysare automatically generated As a result assay projects areindexed on top of the individual records Users visiting anybioassay record can access all relevant datasets of the sameproject without the need for the submitter to specify allconnections As shown in Figure 3 the links to theserelated bioassays are labeled in the BioAssay Summaryservice as lsquoSame Projectrsquo under the lsquoRelated BioAssaysrsquosection The Modulation of the Metabotropic GlutamateReceptor mGluR3 (GRM3) assay (httppubchemncbinlmnihgovassayassaycgiaid=651839) indicates onlyone lsquoDepositor Specifiedrsquo assay whereas eight bioassayrecords were identified as related to the same project bythe new procedure One may see details of the related bio-assays by clicking the link lsquoSame Projectrsquo

PUBLIC ACCESS

BioAssay record and BioAssay summary service

A PubChem BioAssay record can be accessed via theBioAssay Summary service at httppubchemncbinlmnihgovassayassaycgi where myAID is a validBioAssay accession (AID) As shown in Figure 3 for theGRM3 assay (AID 651839) the BioAssay Summaryservice provides (i) full access to submitted informationincluding bioassay protocol descriptions assay dataand cross-references (ii) derived bioassay relationshipsand (iii) tools for evaluating tested compounds studyingSAR or researching target For the lsquoTargetrsquo section alink lsquoMore Bioactivity datarsquo has been recently addedto gather all bioactivity data in PubChem associatedwith the GRM3 target The BioAssay Summary servicenow provides instant access to bioassay data table andenhanced function for data download with improveddatabase infrastructure With the recently launchedPubChem Social Media outreach links to social mediaaccounts are now provided on this page

Figure 3 BioAssay Summary page for bioassay record AID 651839 New and enhanced features are highlighted including fast download instantaccess to data table link to additional bioactivity data targeting GRM3 link to related bioassays on the same project and links to social mediaaccount

6 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

BioAssay search

Keyword search in the PubChem BioAssay database issupported by NCBI Entrez at httpwwwncbinlmnihgovpcassay Textual information in PubChemBioAssay is indexed under numerous fields Anadvanced interface is provided at httpwwwncbinlmnihgovpcassaylimits (Limits page) to access multipleindices and filters (1) Based on information provided incategorized comment fields and keywords in the title of abioassay record new filters were added to support theidentification of records containing (i) biochemical assay(ii) cell-based assay (iii) proteinndashprotein interaction bio-activity and (iv) in vivo or in vitro assay A newly addedmenu lsquoAssay Projectrsquo can be used to select an assay projectand accessing related datasets ChEMBL depositor infor-mation is also indexed to support sub-setting ChEMBLrecords As a result although httpwwwncbinlmnihgovpcassayterm=ChEMBL[sourcename] retrieves allChEMBL bioassays in PubChem httpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3AScientific+Literature225BSourceName5D[SourceName] re-trieves literature-based records from ChEMBL andhttpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3ASt+Jude+Malaria+Screening225BSourceName5D[SourceName] retrieves ChEMBL records de-posited by St Jude Malaria Screening

PubChem BioAssay FTP AND DOWNLOAD

PubChem provides multiple services for users todownload bioassay records which have been describedpreviously (1) This primarily includes (i) an enhanceddownload function at the Summary service (shown inFigure 3) (ii) a web-based BioAssay download serviceat httppubchemncbinlmnihgovassayassaydownloadcgi with a flexible interface supporting full or partial datadownload by specifying bioassay accessions (AIDs) andtested substance accessions (SIDs) and (iii) daily updatedPubChem BioAssay FTP at ftpftpncbinlmnihgovpubchemBioassay providing open access to all bioassaydatasets While the primary FTP structure remains thesame one new FTP directory lsquoExtrasrsquo is added to offeradditional information of the BioAssay resource In thisfolder the file lsquoCid2BioactivityLinkrsquo provides a list oftested compounds and the corresponding URLs linkingto associated bioactivity data Similarly thelsquoGi2BioactivityLinkrsquo and lsquoGeneid2BioactivityLinkrsquo filesprovide the list of the corresponding bioactivity datalinks for protein and gene targets respectively ThelsquoAid2GiGeneidrsquo contains all the bioassay (AID) pro-tein target (GI) and gene target (Gene ID) associationsin the BioAssay database Also a file for assayproject-based related bioassays is added to the directoryat ftpftpncbinlmnihgovpubchemBioassayAssayNeighbors Column headers for the comma-separatedvalues (CSV) format has been modified to provide con-sistency among multiple download methods (ftpftpncbinlmnihgovpubchemBioassayCSVREADME)Readout names are now provided in CSV files to ease dataparsing and interpretation In addition PubChem PUG

SOAP (httppubchemncbinlmnihgovpugpughelphtml) and PUGREST (httppubchemncbinlmnihgovpug_restPUG_RESThtml) facilities are being de-veloped to support programmatic retrieval of bioassayinformation

PubChem UPLOAD FOR BioAssay SUBMISSION

As a public repository handling diverse and vast amountsof chemical structure and bioassay data it is critical forPubChem to provide an efficient and user-friendly way toupload data The recently released PubChem Upload(httppubchemncbinlmnihgovupload) makes use ofadvances in web technologies to offer streamlinedsupport for data submissions and updates to theSubstance and BioAssay databases PubChem Uploadsupports all functionalities and data exchange formats ofits predecessor (1) Furthermore it provides an extensiveset of wizards inline help tips and tutorials for guidingsubmitters to enter assay data and descriptive informa-tion More specifically the new assay submissioncapabilities offered by PubChem Upload include (i)bioassay submission wizards to assist novice users forboth small molecule and RNAi screenings (ii) improveduser interface response to complex input with newer webtechnology (iii) simplified new user registration upgradesfor production user accounts (iv) improved helpincluding hints built into user interface and tutorial (v)extensive PubChem bioassay templates for new submis-sions or for record updates (vi) full editing and integra-tion of assay data and description tables and (vii)expanded importexport handling of spreadsheets forassays A detailed help document tutorial and samplesubmission templates for PubChem Upload are availableat httppubchemncbinlmnihgovuploaddocsupload_helphtml httppubchemncbinlmnihgovuploadtutorial and httppubchemncbinlmnihgovuploaddocsupload_helphtmlAssaySubmission respectively Adetailed description of PubChem Upload will be providedin a separate article

SUMMARY

PubChem is committed to serve as a public repository forbioactivity data of small molecules and RNAi PubChemalso provides an integrated information platform with asuite of tools allowing users to query analyze anddownload all database content PubChem will continueto improve services and tools as technology advancesand to further integrate the information it contains tothird party annotations and other public biomedicaldata With the support of open access to the data andthe delivery of the new Upload system PubChemwelcomes the community to use the resource and to con-tribute data content to the repository

ACKNOWLEDGEMENTS

The authors thank all submitters who have contributeddata to PubChem and the rest of the PubChem team fortheir support

Nucleic Acids Research 2013 7

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

The content in the PubChem BioAssay database iscontributed by gt50 organizations worldwide includingUS government-funded institutions pharmaceuticalcompanies research laboratories and collaboratorshosting chemical biology databases A summary ofbioassay vendors and submission counts is provided athttppubchemncbinlmnihgovsourcesassayBioAssay datasets added during the past 2 years include (i)small molecule data from screening centers of the NIHMolecular Libraries and Imaging Program [MolecularLibrary Program (MLP)] (httpcommonfundnihgovmolecularlibraries) ICCB-LongwoodNSRB ScreenFacility at the Harvard Medical School (httpiccbmedharvardedu) EPA Tox21 (httpepagovncctTox21)and Milwaukee Institute for Drug Discovery (httpwww4uwmedudrugdiscovery) (ii) a curated datasetrecords from the Meiler Lab at Vanderbilt Universitywhich derives the ultimate bioactivity outcome of asmall molecule by combining multiple bioassay results inPubChem to facilitate cheminformatics studies (6) (iii)curated datasets from literature extraction by IUPHAR-DB (7) and ChEMBL (8) and (iv) small interfering RNA(siRNA) data from Drosophila RNAi Screening CenterICCB-LongwoodNSRB Screening Facility at theHarvard Medical School (httpiccbmedharvardedu)Cancer Research UK Cambridge Research InstituteDepartment of Molecular Cell Biology at WeizmannInstitute of Science Institut National de la Sante et dela Recherche Medicale (INSERM) Peterson Lab atGenentech and ten Dijke Lab at Leiden UniversityMedical Center Many of these newly added siRNAdatasets are associated with recent publications injournals such as Nature Cell Biology (9ndash11) GenomeResearch (12) J Virol (13) Cancer Research (14)PNAS (1516) Nature (17ndash19) Science (2021) andNature Genetics (22) Each of these bioassay records islinked to the corresponding abstract in PubMedallowing PubChem users to track down the publicationeasily Vice versa users of PubMed also gain accessto the corresponding bioassay datasets through thiscross-linkPubChem continues to mirror the ChEMBL database

(8) hosted at the European Bioinformatics InstituteMultiple ChEMBL releases and database changes overthe past 2 years have been incorporated into PubChemRecently added annotations at ChEMBL are recorded viathe Categorized Comment field of the PubChem BioAssaydata model (1) Binding surface ligand and lipophilicligand efficiency indices are added to a bioassay recordas additional test results As a result many of thebioassay records in PubChem have gone throughmultiple updates Annotation for bioactivity outcome(eg active or inactive) is largely missing in theChEMBL datasets hindering their integration with therest of PubChem data and analysis tools In such a casePubChem now assigns bioactivity outcome using a50 mM cutoff based on readouts such as IC50 EC50or Ki allowing a larger portion of the ChEMBL datablended in the PubChem systemF

igure

1Growth

inPubChem

BioAssay(A

)Records

(B)bioactivityoutcomes

(countedbyAID

ndashSID

pair)and(C

)uniquetested

samples

Nucleic Acids Research 2013 3

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

DATABASE INFRASTRUCTURE ENHANCEMENT

A robust and scalable database system is crucial tosupport the rapid growth of PubChem BioAssay A setof relational databases and tables is designed and set upon Microsoft SQL servers to (i) accept bioassay submis-sion from depositors (ii) archive bioassay update withversion control (iii) track embargo status (iv) recordand derive links and relationships among bioassays andother biomedical information (v) provide search indexes(vi) support fast data retrieval and analysis and (vii) facili-tate daily update at the FTP site Challenged by theaccelerated growth of bioassay data content greatefforts have been invested in the past years to enhancethe database infrastructure capacity by both hardwareupgrade and revised database design As a result newservices have been added to the PubChem resourceFurthermore performance in bioassay data retrieval anddownload services have been significantly improvedthereby significantly eliminating a queuing system tominimize the user wait time

DATA INTEGRATION AND NEW WEB SERVICES

The PubChem BioAssay database is fully integrated withother biomedical databases hosted by NCBI and providesa suite of web-based and programmatic tools to supportdata access retrieval analysis and download fromPubChem or cross-linked databases (Table 1) Severalnew services for integrating bioassay target and bioactivitydata or grouping bioassays based on an assay project aredescribed later Other developments that have focused onbehind-the-scene enhancement of data retrieval withoutsignificant web interface change will not be summarizedin this work

Rapid access of bioactivity data for a protein orgene target

PubChem BioAssay closes the gap between molecular andchemical biology research by presenting and linking up in-formation of both chemical and RNAi tools in one systemsupporting the study of gene function and biologicalpathways The majority of small molecule screening datain PubChem are associated with protein targets whileRNAi screening data links each tested reagent to a genePubChem provides multiple mechanisms for cross-referencing protein and gene targets from bioactivitydata (1) As a result a protein or gene may link to manybioactivity datasets It is critical to provide rapid access tosuch multi-assay bioactivity data for these protein and genetargets Such a service provides a unique annotation serviceto the corresponding Entrez Protein or Gene record whichleads users to experimental data from chemical biology andRNAi research enhancing the discoverability of the NCBIEntrez system Toward this end two new services theProtein Target Bioactivity Data Tool and the Gene TargetBioactivity Data Tool were developed respectively toaccess associated bioactivity information in PubChemFrom a protein target record such as G-protein-

coupled receptor (GPCR) 35 (httpwwwncbinlmnihgovproteinNP_0052922) bioactivity data for this

protein target can be accessed by the link lsquoBioAssay byTarget (Summary)rsquo As shown in Figure 2A this ProteinTarget Bioactivity Data Tool draws and identifies eachtested substance together with its bioactivity resultsassay title and a link to detailed data such as dose-response curves The data table is sorted by bioactivityoutcome and potency of the substances by defaultshowing first active data and potent reagents Graphicalfilters are provided at the top of the page allowing one todrill down to a data subset of onersquos interest For examplethis GPCR protein has a lsquoProbersquo filter highlighting threechemical probes discovered by a high-throughputscreening (HTS) project for selective GPR35 antagonists

The bioactivity data for the relevant gene target record(httpwwwncbinlmnihgovgene2859) can be accessedby the link lsquoBioAssay by Target (Summary)rsquo With thisGene Target Bioactivity Data Tool a similar summaryof relevant bioassay activity results is displayed asshown in Figure 2B Note that using a gene identifier inthis case additional data are retrieved including RNAitest results (as indicated with the filter lsquoRNAirsquo shownunder lsquoSubstance Typesrsquo) which indicates that GPR35functions as a cellular gene repressing HPV18 LCR asidentified by a genome-wide siRNA screen This exampleillustrates the power of aggregating bioactivity data acrossdatasets onto a unified display The Gene TargetBioactivity Data Tool is particularly useful for accessingdatasets from multiple depositors and literature-baseddata from many journal articles Moreover it links simul-taneously to findings in chemical biology research andRNAi screenings enabling users to evaluate the biologicalrole of a gene and to identify its small molecular regula-tors using data shown on the same display

BioAssays associated with the same assay project

PubChem tracks the relationships among bioassay recordsas indicated by submitters PubChem has also developedseveral computational methods for identifying additionalbioassay linkages based on target sequence similaritycommon active compounds and biological pathways aswell as datasets abstracted from the same publication(1) To better support decision making PubChem nowclusters and links up bioassays based on assay projectsThis feature aims to use data deposited by a network suchas the NIH MLP and the Tox21 program MLP-fundedscreening laboratories are required to deposit data pro-gressively into PubChem as an assay project continuesIt usually takes months or years to finish an assayproject aimed at developing chemical probe hence oftenmultiple bioassay datasets are submitted to PubChem forthe same project but under distinct accessions (AIDs)These datasets are highly relevant often covering aprimary HTS result follow-ups with dose-response andtoxicity testing or counter screenings against biologicallyrelated targets different cell lines or using different assaymethods PubChem allows submitters to specify such re-lationships via the cross-reference (XRef) data field Onthe other hand it is up to the submitters to provide alllinks as new data are made available As a result cross-references to related bioassay datasets unfortunately may

4 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

Figure 2 Bioactivity data for a (A) protein target and (B) gene target

Nucleic Acids Research 2013 5

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

be lacking or incomplete among many datasets making itdifficult for users to discover these key associationsTo improve this situation it is now a common practice to

create a lsquoSummaryrsquo bioassay at the outset of a multi-assayproject and then link each subsequent-related assay back tothat summary record This means that the submitter onlyneeds to specify a single link for each bioassay record to thesame summary and all other links between related assaysare automatically generated As a result assay projects areindexed on top of the individual records Users visiting anybioassay record can access all relevant datasets of the sameproject without the need for the submitter to specify allconnections As shown in Figure 3 the links to theserelated bioassays are labeled in the BioAssay Summaryservice as lsquoSame Projectrsquo under the lsquoRelated BioAssaysrsquosection The Modulation of the Metabotropic GlutamateReceptor mGluR3 (GRM3) assay (httppubchemncbinlmnihgovassayassaycgiaid=651839) indicates onlyone lsquoDepositor Specifiedrsquo assay whereas eight bioassayrecords were identified as related to the same project bythe new procedure One may see details of the related bio-assays by clicking the link lsquoSame Projectrsquo

PUBLIC ACCESS

BioAssay record and BioAssay summary service

A PubChem BioAssay record can be accessed via theBioAssay Summary service at httppubchemncbinlmnihgovassayassaycgi where myAID is a validBioAssay accession (AID) As shown in Figure 3 for theGRM3 assay (AID 651839) the BioAssay Summaryservice provides (i) full access to submitted informationincluding bioassay protocol descriptions assay dataand cross-references (ii) derived bioassay relationshipsand (iii) tools for evaluating tested compounds studyingSAR or researching target For the lsquoTargetrsquo section alink lsquoMore Bioactivity datarsquo has been recently addedto gather all bioactivity data in PubChem associatedwith the GRM3 target The BioAssay Summary servicenow provides instant access to bioassay data table andenhanced function for data download with improveddatabase infrastructure With the recently launchedPubChem Social Media outreach links to social mediaaccounts are now provided on this page

Figure 3 BioAssay Summary page for bioassay record AID 651839 New and enhanced features are highlighted including fast download instantaccess to data table link to additional bioactivity data targeting GRM3 link to related bioassays on the same project and links to social mediaaccount

6 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

BioAssay search

Keyword search in the PubChem BioAssay database issupported by NCBI Entrez at httpwwwncbinlmnihgovpcassay Textual information in PubChemBioAssay is indexed under numerous fields Anadvanced interface is provided at httpwwwncbinlmnihgovpcassaylimits (Limits page) to access multipleindices and filters (1) Based on information provided incategorized comment fields and keywords in the title of abioassay record new filters were added to support theidentification of records containing (i) biochemical assay(ii) cell-based assay (iii) proteinndashprotein interaction bio-activity and (iv) in vivo or in vitro assay A newly addedmenu lsquoAssay Projectrsquo can be used to select an assay projectand accessing related datasets ChEMBL depositor infor-mation is also indexed to support sub-setting ChEMBLrecords As a result although httpwwwncbinlmnihgovpcassayterm=ChEMBL[sourcename] retrieves allChEMBL bioassays in PubChem httpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3AScientific+Literature225BSourceName5D[SourceName] re-trieves literature-based records from ChEMBL andhttpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3ASt+Jude+Malaria+Screening225BSourceName5D[SourceName] retrieves ChEMBL records de-posited by St Jude Malaria Screening

PubChem BioAssay FTP AND DOWNLOAD

PubChem provides multiple services for users todownload bioassay records which have been describedpreviously (1) This primarily includes (i) an enhanceddownload function at the Summary service (shown inFigure 3) (ii) a web-based BioAssay download serviceat httppubchemncbinlmnihgovassayassaydownloadcgi with a flexible interface supporting full or partial datadownload by specifying bioassay accessions (AIDs) andtested substance accessions (SIDs) and (iii) daily updatedPubChem BioAssay FTP at ftpftpncbinlmnihgovpubchemBioassay providing open access to all bioassaydatasets While the primary FTP structure remains thesame one new FTP directory lsquoExtrasrsquo is added to offeradditional information of the BioAssay resource In thisfolder the file lsquoCid2BioactivityLinkrsquo provides a list oftested compounds and the corresponding URLs linkingto associated bioactivity data Similarly thelsquoGi2BioactivityLinkrsquo and lsquoGeneid2BioactivityLinkrsquo filesprovide the list of the corresponding bioactivity datalinks for protein and gene targets respectively ThelsquoAid2GiGeneidrsquo contains all the bioassay (AID) pro-tein target (GI) and gene target (Gene ID) associationsin the BioAssay database Also a file for assayproject-based related bioassays is added to the directoryat ftpftpncbinlmnihgovpubchemBioassayAssayNeighbors Column headers for the comma-separatedvalues (CSV) format has been modified to provide con-sistency among multiple download methods (ftpftpncbinlmnihgovpubchemBioassayCSVREADME)Readout names are now provided in CSV files to ease dataparsing and interpretation In addition PubChem PUG

SOAP (httppubchemncbinlmnihgovpugpughelphtml) and PUGREST (httppubchemncbinlmnihgovpug_restPUG_RESThtml) facilities are being de-veloped to support programmatic retrieval of bioassayinformation

PubChem UPLOAD FOR BioAssay SUBMISSION

As a public repository handling diverse and vast amountsof chemical structure and bioassay data it is critical forPubChem to provide an efficient and user-friendly way toupload data The recently released PubChem Upload(httppubchemncbinlmnihgovupload) makes use ofadvances in web technologies to offer streamlinedsupport for data submissions and updates to theSubstance and BioAssay databases PubChem Uploadsupports all functionalities and data exchange formats ofits predecessor (1) Furthermore it provides an extensiveset of wizards inline help tips and tutorials for guidingsubmitters to enter assay data and descriptive informa-tion More specifically the new assay submissioncapabilities offered by PubChem Upload include (i)bioassay submission wizards to assist novice users forboth small molecule and RNAi screenings (ii) improveduser interface response to complex input with newer webtechnology (iii) simplified new user registration upgradesfor production user accounts (iv) improved helpincluding hints built into user interface and tutorial (v)extensive PubChem bioassay templates for new submis-sions or for record updates (vi) full editing and integra-tion of assay data and description tables and (vii)expanded importexport handling of spreadsheets forassays A detailed help document tutorial and samplesubmission templates for PubChem Upload are availableat httppubchemncbinlmnihgovuploaddocsupload_helphtml httppubchemncbinlmnihgovuploadtutorial and httppubchemncbinlmnihgovuploaddocsupload_helphtmlAssaySubmission respectively Adetailed description of PubChem Upload will be providedin a separate article

SUMMARY

PubChem is committed to serve as a public repository forbioactivity data of small molecules and RNAi PubChemalso provides an integrated information platform with asuite of tools allowing users to query analyze anddownload all database content PubChem will continueto improve services and tools as technology advancesand to further integrate the information it contains tothird party annotations and other public biomedicaldata With the support of open access to the data andthe delivery of the new Upload system PubChemwelcomes the community to use the resource and to con-tribute data content to the repository

ACKNOWLEDGEMENTS

The authors thank all submitters who have contributeddata to PubChem and the rest of the PubChem team fortheir support

Nucleic Acids Research 2013 7

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

DATABASE INFRASTRUCTURE ENHANCEMENT

A robust and scalable database system is crucial tosupport the rapid growth of PubChem BioAssay A setof relational databases and tables is designed and set upon Microsoft SQL servers to (i) accept bioassay submis-sion from depositors (ii) archive bioassay update withversion control (iii) track embargo status (iv) recordand derive links and relationships among bioassays andother biomedical information (v) provide search indexes(vi) support fast data retrieval and analysis and (vii) facili-tate daily update at the FTP site Challenged by theaccelerated growth of bioassay data content greatefforts have been invested in the past years to enhancethe database infrastructure capacity by both hardwareupgrade and revised database design As a result newservices have been added to the PubChem resourceFurthermore performance in bioassay data retrieval anddownload services have been significantly improvedthereby significantly eliminating a queuing system tominimize the user wait time

DATA INTEGRATION AND NEW WEB SERVICES

The PubChem BioAssay database is fully integrated withother biomedical databases hosted by NCBI and providesa suite of web-based and programmatic tools to supportdata access retrieval analysis and download fromPubChem or cross-linked databases (Table 1) Severalnew services for integrating bioassay target and bioactivitydata or grouping bioassays based on an assay project aredescribed later Other developments that have focused onbehind-the-scene enhancement of data retrieval withoutsignificant web interface change will not be summarizedin this work

Rapid access of bioactivity data for a protein orgene target

PubChem BioAssay closes the gap between molecular andchemical biology research by presenting and linking up in-formation of both chemical and RNAi tools in one systemsupporting the study of gene function and biologicalpathways The majority of small molecule screening datain PubChem are associated with protein targets whileRNAi screening data links each tested reagent to a genePubChem provides multiple mechanisms for cross-referencing protein and gene targets from bioactivitydata (1) As a result a protein or gene may link to manybioactivity datasets It is critical to provide rapid access tosuch multi-assay bioactivity data for these protein and genetargets Such a service provides a unique annotation serviceto the corresponding Entrez Protein or Gene record whichleads users to experimental data from chemical biology andRNAi research enhancing the discoverability of the NCBIEntrez system Toward this end two new services theProtein Target Bioactivity Data Tool and the Gene TargetBioactivity Data Tool were developed respectively toaccess associated bioactivity information in PubChemFrom a protein target record such as G-protein-

coupled receptor (GPCR) 35 (httpwwwncbinlmnihgovproteinNP_0052922) bioactivity data for this

protein target can be accessed by the link lsquoBioAssay byTarget (Summary)rsquo As shown in Figure 2A this ProteinTarget Bioactivity Data Tool draws and identifies eachtested substance together with its bioactivity resultsassay title and a link to detailed data such as dose-response curves The data table is sorted by bioactivityoutcome and potency of the substances by defaultshowing first active data and potent reagents Graphicalfilters are provided at the top of the page allowing one todrill down to a data subset of onersquos interest For examplethis GPCR protein has a lsquoProbersquo filter highlighting threechemical probes discovered by a high-throughputscreening (HTS) project for selective GPR35 antagonists

The bioactivity data for the relevant gene target record(httpwwwncbinlmnihgovgene2859) can be accessedby the link lsquoBioAssay by Target (Summary)rsquo With thisGene Target Bioactivity Data Tool a similar summaryof relevant bioassay activity results is displayed asshown in Figure 2B Note that using a gene identifier inthis case additional data are retrieved including RNAitest results (as indicated with the filter lsquoRNAirsquo shownunder lsquoSubstance Typesrsquo) which indicates that GPR35functions as a cellular gene repressing HPV18 LCR asidentified by a genome-wide siRNA screen This exampleillustrates the power of aggregating bioactivity data acrossdatasets onto a unified display The Gene TargetBioactivity Data Tool is particularly useful for accessingdatasets from multiple depositors and literature-baseddata from many journal articles Moreover it links simul-taneously to findings in chemical biology research andRNAi screenings enabling users to evaluate the biologicalrole of a gene and to identify its small molecular regula-tors using data shown on the same display

BioAssays associated with the same assay project

PubChem tracks the relationships among bioassay recordsas indicated by submitters PubChem has also developedseveral computational methods for identifying additionalbioassay linkages based on target sequence similaritycommon active compounds and biological pathways aswell as datasets abstracted from the same publication(1) To better support decision making PubChem nowclusters and links up bioassays based on assay projectsThis feature aims to use data deposited by a network suchas the NIH MLP and the Tox21 program MLP-fundedscreening laboratories are required to deposit data pro-gressively into PubChem as an assay project continuesIt usually takes months or years to finish an assayproject aimed at developing chemical probe hence oftenmultiple bioassay datasets are submitted to PubChem forthe same project but under distinct accessions (AIDs)These datasets are highly relevant often covering aprimary HTS result follow-ups with dose-response andtoxicity testing or counter screenings against biologicallyrelated targets different cell lines or using different assaymethods PubChem allows submitters to specify such re-lationships via the cross-reference (XRef) data field Onthe other hand it is up to the submitters to provide alllinks as new data are made available As a result cross-references to related bioassay datasets unfortunately may

4 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

Figure 2 Bioactivity data for a (A) protein target and (B) gene target

Nucleic Acids Research 2013 5

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

be lacking or incomplete among many datasets making itdifficult for users to discover these key associationsTo improve this situation it is now a common practice to

create a lsquoSummaryrsquo bioassay at the outset of a multi-assayproject and then link each subsequent-related assay back tothat summary record This means that the submitter onlyneeds to specify a single link for each bioassay record to thesame summary and all other links between related assaysare automatically generated As a result assay projects areindexed on top of the individual records Users visiting anybioassay record can access all relevant datasets of the sameproject without the need for the submitter to specify allconnections As shown in Figure 3 the links to theserelated bioassays are labeled in the BioAssay Summaryservice as lsquoSame Projectrsquo under the lsquoRelated BioAssaysrsquosection The Modulation of the Metabotropic GlutamateReceptor mGluR3 (GRM3) assay (httppubchemncbinlmnihgovassayassaycgiaid=651839) indicates onlyone lsquoDepositor Specifiedrsquo assay whereas eight bioassayrecords were identified as related to the same project bythe new procedure One may see details of the related bio-assays by clicking the link lsquoSame Projectrsquo

PUBLIC ACCESS

BioAssay record and BioAssay summary service

A PubChem BioAssay record can be accessed via theBioAssay Summary service at httppubchemncbinlmnihgovassayassaycgi where myAID is a validBioAssay accession (AID) As shown in Figure 3 for theGRM3 assay (AID 651839) the BioAssay Summaryservice provides (i) full access to submitted informationincluding bioassay protocol descriptions assay dataand cross-references (ii) derived bioassay relationshipsand (iii) tools for evaluating tested compounds studyingSAR or researching target For the lsquoTargetrsquo section alink lsquoMore Bioactivity datarsquo has been recently addedto gather all bioactivity data in PubChem associatedwith the GRM3 target The BioAssay Summary servicenow provides instant access to bioassay data table andenhanced function for data download with improveddatabase infrastructure With the recently launchedPubChem Social Media outreach links to social mediaaccounts are now provided on this page

Figure 3 BioAssay Summary page for bioassay record AID 651839 New and enhanced features are highlighted including fast download instantaccess to data table link to additional bioactivity data targeting GRM3 link to related bioassays on the same project and links to social mediaaccount

6 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

BioAssay search

Keyword search in the PubChem BioAssay database issupported by NCBI Entrez at httpwwwncbinlmnihgovpcassay Textual information in PubChemBioAssay is indexed under numerous fields Anadvanced interface is provided at httpwwwncbinlmnihgovpcassaylimits (Limits page) to access multipleindices and filters (1) Based on information provided incategorized comment fields and keywords in the title of abioassay record new filters were added to support theidentification of records containing (i) biochemical assay(ii) cell-based assay (iii) proteinndashprotein interaction bio-activity and (iv) in vivo or in vitro assay A newly addedmenu lsquoAssay Projectrsquo can be used to select an assay projectand accessing related datasets ChEMBL depositor infor-mation is also indexed to support sub-setting ChEMBLrecords As a result although httpwwwncbinlmnihgovpcassayterm=ChEMBL[sourcename] retrieves allChEMBL bioassays in PubChem httpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3AScientific+Literature225BSourceName5D[SourceName] re-trieves literature-based records from ChEMBL andhttpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3ASt+Jude+Malaria+Screening225BSourceName5D[SourceName] retrieves ChEMBL records de-posited by St Jude Malaria Screening

PubChem BioAssay FTP AND DOWNLOAD

PubChem provides multiple services for users todownload bioassay records which have been describedpreviously (1) This primarily includes (i) an enhanceddownload function at the Summary service (shown inFigure 3) (ii) a web-based BioAssay download serviceat httppubchemncbinlmnihgovassayassaydownloadcgi with a flexible interface supporting full or partial datadownload by specifying bioassay accessions (AIDs) andtested substance accessions (SIDs) and (iii) daily updatedPubChem BioAssay FTP at ftpftpncbinlmnihgovpubchemBioassay providing open access to all bioassaydatasets While the primary FTP structure remains thesame one new FTP directory lsquoExtrasrsquo is added to offeradditional information of the BioAssay resource In thisfolder the file lsquoCid2BioactivityLinkrsquo provides a list oftested compounds and the corresponding URLs linkingto associated bioactivity data Similarly thelsquoGi2BioactivityLinkrsquo and lsquoGeneid2BioactivityLinkrsquo filesprovide the list of the corresponding bioactivity datalinks for protein and gene targets respectively ThelsquoAid2GiGeneidrsquo contains all the bioassay (AID) pro-tein target (GI) and gene target (Gene ID) associationsin the BioAssay database Also a file for assayproject-based related bioassays is added to the directoryat ftpftpncbinlmnihgovpubchemBioassayAssayNeighbors Column headers for the comma-separatedvalues (CSV) format has been modified to provide con-sistency among multiple download methods (ftpftpncbinlmnihgovpubchemBioassayCSVREADME)Readout names are now provided in CSV files to ease dataparsing and interpretation In addition PubChem PUG

SOAP (httppubchemncbinlmnihgovpugpughelphtml) and PUGREST (httppubchemncbinlmnihgovpug_restPUG_RESThtml) facilities are being de-veloped to support programmatic retrieval of bioassayinformation

PubChem UPLOAD FOR BioAssay SUBMISSION

As a public repository handling diverse and vast amountsof chemical structure and bioassay data it is critical forPubChem to provide an efficient and user-friendly way toupload data The recently released PubChem Upload(httppubchemncbinlmnihgovupload) makes use ofadvances in web technologies to offer streamlinedsupport for data submissions and updates to theSubstance and BioAssay databases PubChem Uploadsupports all functionalities and data exchange formats ofits predecessor (1) Furthermore it provides an extensiveset of wizards inline help tips and tutorials for guidingsubmitters to enter assay data and descriptive informa-tion More specifically the new assay submissioncapabilities offered by PubChem Upload include (i)bioassay submission wizards to assist novice users forboth small molecule and RNAi screenings (ii) improveduser interface response to complex input with newer webtechnology (iii) simplified new user registration upgradesfor production user accounts (iv) improved helpincluding hints built into user interface and tutorial (v)extensive PubChem bioassay templates for new submis-sions or for record updates (vi) full editing and integra-tion of assay data and description tables and (vii)expanded importexport handling of spreadsheets forassays A detailed help document tutorial and samplesubmission templates for PubChem Upload are availableat httppubchemncbinlmnihgovuploaddocsupload_helphtml httppubchemncbinlmnihgovuploadtutorial and httppubchemncbinlmnihgovuploaddocsupload_helphtmlAssaySubmission respectively Adetailed description of PubChem Upload will be providedin a separate article

SUMMARY

PubChem is committed to serve as a public repository forbioactivity data of small molecules and RNAi PubChemalso provides an integrated information platform with asuite of tools allowing users to query analyze anddownload all database content PubChem will continueto improve services and tools as technology advancesand to further integrate the information it contains tothird party annotations and other public biomedicaldata With the support of open access to the data andthe delivery of the new Upload system PubChemwelcomes the community to use the resource and to con-tribute data content to the repository

ACKNOWLEDGEMENTS

The authors thank all submitters who have contributeddata to PubChem and the rest of the PubChem team fortheir support

Nucleic Acids Research 2013 7

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

Figure 2 Bioactivity data for a (A) protein target and (B) gene target

Nucleic Acids Research 2013 5

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

be lacking or incomplete among many datasets making itdifficult for users to discover these key associationsTo improve this situation it is now a common practice to

create a lsquoSummaryrsquo bioassay at the outset of a multi-assayproject and then link each subsequent-related assay back tothat summary record This means that the submitter onlyneeds to specify a single link for each bioassay record to thesame summary and all other links between related assaysare automatically generated As a result assay projects areindexed on top of the individual records Users visiting anybioassay record can access all relevant datasets of the sameproject without the need for the submitter to specify allconnections As shown in Figure 3 the links to theserelated bioassays are labeled in the BioAssay Summaryservice as lsquoSame Projectrsquo under the lsquoRelated BioAssaysrsquosection The Modulation of the Metabotropic GlutamateReceptor mGluR3 (GRM3) assay (httppubchemncbinlmnihgovassayassaycgiaid=651839) indicates onlyone lsquoDepositor Specifiedrsquo assay whereas eight bioassayrecords were identified as related to the same project bythe new procedure One may see details of the related bio-assays by clicking the link lsquoSame Projectrsquo

PUBLIC ACCESS

BioAssay record and BioAssay summary service

A PubChem BioAssay record can be accessed via theBioAssay Summary service at httppubchemncbinlmnihgovassayassaycgi where myAID is a validBioAssay accession (AID) As shown in Figure 3 for theGRM3 assay (AID 651839) the BioAssay Summaryservice provides (i) full access to submitted informationincluding bioassay protocol descriptions assay dataand cross-references (ii) derived bioassay relationshipsand (iii) tools for evaluating tested compounds studyingSAR or researching target For the lsquoTargetrsquo section alink lsquoMore Bioactivity datarsquo has been recently addedto gather all bioactivity data in PubChem associatedwith the GRM3 target The BioAssay Summary servicenow provides instant access to bioassay data table andenhanced function for data download with improveddatabase infrastructure With the recently launchedPubChem Social Media outreach links to social mediaaccounts are now provided on this page

Figure 3 BioAssay Summary page for bioassay record AID 651839 New and enhanced features are highlighted including fast download instantaccess to data table link to additional bioactivity data targeting GRM3 link to related bioassays on the same project and links to social mediaaccount

6 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

BioAssay search

Keyword search in the PubChem BioAssay database issupported by NCBI Entrez at httpwwwncbinlmnihgovpcassay Textual information in PubChemBioAssay is indexed under numerous fields Anadvanced interface is provided at httpwwwncbinlmnihgovpcassaylimits (Limits page) to access multipleindices and filters (1) Based on information provided incategorized comment fields and keywords in the title of abioassay record new filters were added to support theidentification of records containing (i) biochemical assay(ii) cell-based assay (iii) proteinndashprotein interaction bio-activity and (iv) in vivo or in vitro assay A newly addedmenu lsquoAssay Projectrsquo can be used to select an assay projectand accessing related datasets ChEMBL depositor infor-mation is also indexed to support sub-setting ChEMBLrecords As a result although httpwwwncbinlmnihgovpcassayterm=ChEMBL[sourcename] retrieves allChEMBL bioassays in PubChem httpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3AScientific+Literature225BSourceName5D[SourceName] re-trieves literature-based records from ChEMBL andhttpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3ASt+Jude+Malaria+Screening225BSourceName5D[SourceName] retrieves ChEMBL records de-posited by St Jude Malaria Screening

PubChem BioAssay FTP AND DOWNLOAD

PubChem provides multiple services for users todownload bioassay records which have been describedpreviously (1) This primarily includes (i) an enhanceddownload function at the Summary service (shown inFigure 3) (ii) a web-based BioAssay download serviceat httppubchemncbinlmnihgovassayassaydownloadcgi with a flexible interface supporting full or partial datadownload by specifying bioassay accessions (AIDs) andtested substance accessions (SIDs) and (iii) daily updatedPubChem BioAssay FTP at ftpftpncbinlmnihgovpubchemBioassay providing open access to all bioassaydatasets While the primary FTP structure remains thesame one new FTP directory lsquoExtrasrsquo is added to offeradditional information of the BioAssay resource In thisfolder the file lsquoCid2BioactivityLinkrsquo provides a list oftested compounds and the corresponding URLs linkingto associated bioactivity data Similarly thelsquoGi2BioactivityLinkrsquo and lsquoGeneid2BioactivityLinkrsquo filesprovide the list of the corresponding bioactivity datalinks for protein and gene targets respectively ThelsquoAid2GiGeneidrsquo contains all the bioassay (AID) pro-tein target (GI) and gene target (Gene ID) associationsin the BioAssay database Also a file for assayproject-based related bioassays is added to the directoryat ftpftpncbinlmnihgovpubchemBioassayAssayNeighbors Column headers for the comma-separatedvalues (CSV) format has been modified to provide con-sistency among multiple download methods (ftpftpncbinlmnihgovpubchemBioassayCSVREADME)Readout names are now provided in CSV files to ease dataparsing and interpretation In addition PubChem PUG

SOAP (httppubchemncbinlmnihgovpugpughelphtml) and PUGREST (httppubchemncbinlmnihgovpug_restPUG_RESThtml) facilities are being de-veloped to support programmatic retrieval of bioassayinformation

PubChem UPLOAD FOR BioAssay SUBMISSION

As a public repository handling diverse and vast amountsof chemical structure and bioassay data it is critical forPubChem to provide an efficient and user-friendly way toupload data The recently released PubChem Upload(httppubchemncbinlmnihgovupload) makes use ofadvances in web technologies to offer streamlinedsupport for data submissions and updates to theSubstance and BioAssay databases PubChem Uploadsupports all functionalities and data exchange formats ofits predecessor (1) Furthermore it provides an extensiveset of wizards inline help tips and tutorials for guidingsubmitters to enter assay data and descriptive informa-tion More specifically the new assay submissioncapabilities offered by PubChem Upload include (i)bioassay submission wizards to assist novice users forboth small molecule and RNAi screenings (ii) improveduser interface response to complex input with newer webtechnology (iii) simplified new user registration upgradesfor production user accounts (iv) improved helpincluding hints built into user interface and tutorial (v)extensive PubChem bioassay templates for new submis-sions or for record updates (vi) full editing and integra-tion of assay data and description tables and (vii)expanded importexport handling of spreadsheets forassays A detailed help document tutorial and samplesubmission templates for PubChem Upload are availableat httppubchemncbinlmnihgovuploaddocsupload_helphtml httppubchemncbinlmnihgovuploadtutorial and httppubchemncbinlmnihgovuploaddocsupload_helphtmlAssaySubmission respectively Adetailed description of PubChem Upload will be providedin a separate article

SUMMARY

PubChem is committed to serve as a public repository forbioactivity data of small molecules and RNAi PubChemalso provides an integrated information platform with asuite of tools allowing users to query analyze anddownload all database content PubChem will continueto improve services and tools as technology advancesand to further integrate the information it contains tothird party annotations and other public biomedicaldata With the support of open access to the data andthe delivery of the new Upload system PubChemwelcomes the community to use the resource and to con-tribute data content to the repository

ACKNOWLEDGEMENTS

The authors thank all submitters who have contributeddata to PubChem and the rest of the PubChem team fortheir support

Nucleic Acids Research 2013 7

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

be lacking or incomplete among many datasets making itdifficult for users to discover these key associationsTo improve this situation it is now a common practice to

create a lsquoSummaryrsquo bioassay at the outset of a multi-assayproject and then link each subsequent-related assay back tothat summary record This means that the submitter onlyneeds to specify a single link for each bioassay record to thesame summary and all other links between related assaysare automatically generated As a result assay projects areindexed on top of the individual records Users visiting anybioassay record can access all relevant datasets of the sameproject without the need for the submitter to specify allconnections As shown in Figure 3 the links to theserelated bioassays are labeled in the BioAssay Summaryservice as lsquoSame Projectrsquo under the lsquoRelated BioAssaysrsquosection The Modulation of the Metabotropic GlutamateReceptor mGluR3 (GRM3) assay (httppubchemncbinlmnihgovassayassaycgiaid=651839) indicates onlyone lsquoDepositor Specifiedrsquo assay whereas eight bioassayrecords were identified as related to the same project bythe new procedure One may see details of the related bio-assays by clicking the link lsquoSame Projectrsquo

PUBLIC ACCESS

BioAssay record and BioAssay summary service

A PubChem BioAssay record can be accessed via theBioAssay Summary service at httppubchemncbinlmnihgovassayassaycgi where myAID is a validBioAssay accession (AID) As shown in Figure 3 for theGRM3 assay (AID 651839) the BioAssay Summaryservice provides (i) full access to submitted informationincluding bioassay protocol descriptions assay dataand cross-references (ii) derived bioassay relationshipsand (iii) tools for evaluating tested compounds studyingSAR or researching target For the lsquoTargetrsquo section alink lsquoMore Bioactivity datarsquo has been recently addedto gather all bioactivity data in PubChem associatedwith the GRM3 target The BioAssay Summary servicenow provides instant access to bioassay data table andenhanced function for data download with improveddatabase infrastructure With the recently launchedPubChem Social Media outreach links to social mediaaccounts are now provided on this page

Figure 3 BioAssay Summary page for bioassay record AID 651839 New and enhanced features are highlighted including fast download instantaccess to data table link to additional bioactivity data targeting GRM3 link to related bioassays on the same project and links to social mediaaccount

6 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

BioAssay search

Keyword search in the PubChem BioAssay database issupported by NCBI Entrez at httpwwwncbinlmnihgovpcassay Textual information in PubChemBioAssay is indexed under numerous fields Anadvanced interface is provided at httpwwwncbinlmnihgovpcassaylimits (Limits page) to access multipleindices and filters (1) Based on information provided incategorized comment fields and keywords in the title of abioassay record new filters were added to support theidentification of records containing (i) biochemical assay(ii) cell-based assay (iii) proteinndashprotein interaction bio-activity and (iv) in vivo or in vitro assay A newly addedmenu lsquoAssay Projectrsquo can be used to select an assay projectand accessing related datasets ChEMBL depositor infor-mation is also indexed to support sub-setting ChEMBLrecords As a result although httpwwwncbinlmnihgovpcassayterm=ChEMBL[sourcename] retrieves allChEMBL bioassays in PubChem httpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3AScientific+Literature225BSourceName5D[SourceName] re-trieves literature-based records from ChEMBL andhttpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3ASt+Jude+Malaria+Screening225BSourceName5D[SourceName] retrieves ChEMBL records de-posited by St Jude Malaria Screening

PubChem BioAssay FTP AND DOWNLOAD

PubChem provides multiple services for users todownload bioassay records which have been describedpreviously (1) This primarily includes (i) an enhanceddownload function at the Summary service (shown inFigure 3) (ii) a web-based BioAssay download serviceat httppubchemncbinlmnihgovassayassaydownloadcgi with a flexible interface supporting full or partial datadownload by specifying bioassay accessions (AIDs) andtested substance accessions (SIDs) and (iii) daily updatedPubChem BioAssay FTP at ftpftpncbinlmnihgovpubchemBioassay providing open access to all bioassaydatasets While the primary FTP structure remains thesame one new FTP directory lsquoExtrasrsquo is added to offeradditional information of the BioAssay resource In thisfolder the file lsquoCid2BioactivityLinkrsquo provides a list oftested compounds and the corresponding URLs linkingto associated bioactivity data Similarly thelsquoGi2BioactivityLinkrsquo and lsquoGeneid2BioactivityLinkrsquo filesprovide the list of the corresponding bioactivity datalinks for protein and gene targets respectively ThelsquoAid2GiGeneidrsquo contains all the bioassay (AID) pro-tein target (GI) and gene target (Gene ID) associationsin the BioAssay database Also a file for assayproject-based related bioassays is added to the directoryat ftpftpncbinlmnihgovpubchemBioassayAssayNeighbors Column headers for the comma-separatedvalues (CSV) format has been modified to provide con-sistency among multiple download methods (ftpftpncbinlmnihgovpubchemBioassayCSVREADME)Readout names are now provided in CSV files to ease dataparsing and interpretation In addition PubChem PUG

SOAP (httppubchemncbinlmnihgovpugpughelphtml) and PUGREST (httppubchemncbinlmnihgovpug_restPUG_RESThtml) facilities are being de-veloped to support programmatic retrieval of bioassayinformation

PubChem UPLOAD FOR BioAssay SUBMISSION

As a public repository handling diverse and vast amountsof chemical structure and bioassay data it is critical forPubChem to provide an efficient and user-friendly way toupload data The recently released PubChem Upload(httppubchemncbinlmnihgovupload) makes use ofadvances in web technologies to offer streamlinedsupport for data submissions and updates to theSubstance and BioAssay databases PubChem Uploadsupports all functionalities and data exchange formats ofits predecessor (1) Furthermore it provides an extensiveset of wizards inline help tips and tutorials for guidingsubmitters to enter assay data and descriptive informa-tion More specifically the new assay submissioncapabilities offered by PubChem Upload include (i)bioassay submission wizards to assist novice users forboth small molecule and RNAi screenings (ii) improveduser interface response to complex input with newer webtechnology (iii) simplified new user registration upgradesfor production user accounts (iv) improved helpincluding hints built into user interface and tutorial (v)extensive PubChem bioassay templates for new submis-sions or for record updates (vi) full editing and integra-tion of assay data and description tables and (vii)expanded importexport handling of spreadsheets forassays A detailed help document tutorial and samplesubmission templates for PubChem Upload are availableat httppubchemncbinlmnihgovuploaddocsupload_helphtml httppubchemncbinlmnihgovuploadtutorial and httppubchemncbinlmnihgovuploaddocsupload_helphtmlAssaySubmission respectively Adetailed description of PubChem Upload will be providedin a separate article

SUMMARY

PubChem is committed to serve as a public repository forbioactivity data of small molecules and RNAi PubChemalso provides an integrated information platform with asuite of tools allowing users to query analyze anddownload all database content PubChem will continueto improve services and tools as technology advancesand to further integrate the information it contains tothird party annotations and other public biomedicaldata With the support of open access to the data andthe delivery of the new Upload system PubChemwelcomes the community to use the resource and to con-tribute data content to the repository

ACKNOWLEDGEMENTS

The authors thank all submitters who have contributeddata to PubChem and the rest of the PubChem team fortheir support

Nucleic Acids Research 2013 7

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

BioAssay search

Keyword search in the PubChem BioAssay database issupported by NCBI Entrez at httpwwwncbinlmnihgovpcassay Textual information in PubChemBioAssay is indexed under numerous fields Anadvanced interface is provided at httpwwwncbinlmnihgovpcassaylimits (Limits page) to access multipleindices and filters (1) Based on information provided incategorized comment fields and keywords in the title of abioassay record new filters were added to support theidentification of records containing (i) biochemical assay(ii) cell-based assay (iii) proteinndashprotein interaction bio-activity and (iv) in vivo or in vitro assay A newly addedmenu lsquoAssay Projectrsquo can be used to select an assay projectand accessing related datasets ChEMBL depositor infor-mation is also indexed to support sub-setting ChEMBLrecords As a result although httpwwwncbinlmnihgovpcassayterm=ChEMBL[sourcename] retrieves allChEMBL bioassays in PubChem httpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3AScientific+Literature225BSourceName5D[SourceName] re-trieves literature-based records from ChEMBL andhttpwwwncbinlmnihgovpcassayterm=22ChEMBL3A3ASt+Jude+Malaria+Screening225BSourceName5D[SourceName] retrieves ChEMBL records de-posited by St Jude Malaria Screening

PubChem BioAssay FTP AND DOWNLOAD

PubChem provides multiple services for users todownload bioassay records which have been describedpreviously (1) This primarily includes (i) an enhanceddownload function at the Summary service (shown inFigure 3) (ii) a web-based BioAssay download serviceat httppubchemncbinlmnihgovassayassaydownloadcgi with a flexible interface supporting full or partial datadownload by specifying bioassay accessions (AIDs) andtested substance accessions (SIDs) and (iii) daily updatedPubChem BioAssay FTP at ftpftpncbinlmnihgovpubchemBioassay providing open access to all bioassaydatasets While the primary FTP structure remains thesame one new FTP directory lsquoExtrasrsquo is added to offeradditional information of the BioAssay resource In thisfolder the file lsquoCid2BioactivityLinkrsquo provides a list oftested compounds and the corresponding URLs linkingto associated bioactivity data Similarly thelsquoGi2BioactivityLinkrsquo and lsquoGeneid2BioactivityLinkrsquo filesprovide the list of the corresponding bioactivity datalinks for protein and gene targets respectively ThelsquoAid2GiGeneidrsquo contains all the bioassay (AID) pro-tein target (GI) and gene target (Gene ID) associationsin the BioAssay database Also a file for assayproject-based related bioassays is added to the directoryat ftpftpncbinlmnihgovpubchemBioassayAssayNeighbors Column headers for the comma-separatedvalues (CSV) format has been modified to provide con-sistency among multiple download methods (ftpftpncbinlmnihgovpubchemBioassayCSVREADME)Readout names are now provided in CSV files to ease dataparsing and interpretation In addition PubChem PUG

SOAP (httppubchemncbinlmnihgovpugpughelphtml) and PUGREST (httppubchemncbinlmnihgovpug_restPUG_RESThtml) facilities are being de-veloped to support programmatic retrieval of bioassayinformation

PubChem UPLOAD FOR BioAssay SUBMISSION

As a public repository handling diverse and vast amountsof chemical structure and bioassay data it is critical forPubChem to provide an efficient and user-friendly way toupload data The recently released PubChem Upload(httppubchemncbinlmnihgovupload) makes use ofadvances in web technologies to offer streamlinedsupport for data submissions and updates to theSubstance and BioAssay databases PubChem Uploadsupports all functionalities and data exchange formats ofits predecessor (1) Furthermore it provides an extensiveset of wizards inline help tips and tutorials for guidingsubmitters to enter assay data and descriptive informa-tion More specifically the new assay submissioncapabilities offered by PubChem Upload include (i)bioassay submission wizards to assist novice users forboth small molecule and RNAi screenings (ii) improveduser interface response to complex input with newer webtechnology (iii) simplified new user registration upgradesfor production user accounts (iv) improved helpincluding hints built into user interface and tutorial (v)extensive PubChem bioassay templates for new submis-sions or for record updates (vi) full editing and integra-tion of assay data and description tables and (vii)expanded importexport handling of spreadsheets forassays A detailed help document tutorial and samplesubmission templates for PubChem Upload are availableat httppubchemncbinlmnihgovuploaddocsupload_helphtml httppubchemncbinlmnihgovuploadtutorial and httppubchemncbinlmnihgovuploaddocsupload_helphtmlAssaySubmission respectively Adetailed description of PubChem Upload will be providedin a separate article

SUMMARY

PubChem is committed to serve as a public repository forbioactivity data of small molecules and RNAi PubChemalso provides an integrated information platform with asuite of tools allowing users to query analyze anddownload all database content PubChem will continueto improve services and tools as technology advancesand to further integrate the information it contains tothird party annotations and other public biomedicaldata With the support of open access to the data andthe delivery of the new Upload system PubChemwelcomes the community to use the resource and to con-tribute data content to the repository

ACKNOWLEDGEMENTS

The authors thank all submitters who have contributeddata to PubChem and the rest of the PubChem team fortheir support

Nucleic Acids Research 2013 7

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from

FUNDING

The NIH Intramural Research program Funding foropen access charge National Insitutes of Health USA

Conflict of interest statement None declared

REFERENCES

1 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay database Nucleic Acids Res 40D400ndashD412

2 WangY BoltonE DrachevaS KarapetyanKShoemakerBA SuzekTO WangJ XiaoJ ZhangJ andBryantSH (2010) An overview of the PubChem BioAssayresource Nucleic Acids Res 38 D255ndashD266

3 WangY XiaoJ SuzekTO ZhangJ WangJ and BryantSH(2009) PubChem a public information system for analyzingbioactivities of small molecules Nucleic Acids Res 37W623ndashW633

4 BoltonEE WangY ThiessenPA and BryantSH (2008)PubChem integrated platform of small molecules and biologicalactivities Annu Rep Comput Chem 4 217ndash241

5 SayersEW BarrettT BensonDA BoltonE BryantSHCaneseK ChetverninV ChurchDM DiCuccioMFederhenS et al (2011) Database resources of the NationalCenter for Biotechnology Information Nucleic Acids Res 39D38ndashD51

6 ButkiewiczM LoweEW Jr MuellerR MendenhallJLTeixeiraPL WeaverCD and MeilerJ (2013) Benchmarkingligand-based virtual high-throughput screening with the PubChemdatabase Molecules 18 735ndash756

7 SharmanJL BensonHE PawsonAJ LukitoVMpamhangaCP BombailV DavenportAP PetersJASpeddingM and HarmarAJ (2013) IUPHAR-DB updateddatabase content and new features Nucleic Acids Res 41D1083ndashD1088

8 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

9 MulderKW WangX EscriuC ItoY SchwarzRF GillisJSirokmanyG DonatiG Uribe-LewisS PavlidisP et al (2012)Diverse epigenetic strategies interact to control epidermaldifferentiation Nat Cell Biol 14 753ndash763

10 ChihB LiuP ChinnY ChalouniC KomuvesLG HassPESandovalW and PetersonAS (2012) A ciliopathy complex atthe transition zone protects the cilia as a privileged membranedomain Nat Cell Biol 14 61ndash72

11 Prager-KhoutorskyM LichtensteinA KrishnanRRajendranK MayoA KamZ GeigerB and BershadskyAD(2011) Fibroblast polarization is a matrix-rigidity-dependentprocess controlled by focal adhesion mechanosensing Nat CellBiol 13 1457ndash1465

12 Imberg-KazdanK HaS GreenfieldA PoultneyCSBonneauR LoganSK and GarabedianMJ (2013) A genome-wide RNA interference screen identifies new regulators ofandrogen receptor function in prostate cancer cells Genome Res23 581ndash591

13 PowellML SmithJA SowaME HarperJW IftnerTStubenrauchF and HowleyPM (2010) NCoR1 mediatespapillomavirus E8E2C transcriptional repression J Virol 844451ndash4460

14 GalluzziL MorselliE VitaleI KeppO SenovillaLCriolloA ServantN PaccardC HupeP RobertT et al(2010) miR-181a and miR-630 regulate cisplatin-induced cancercell death Cancer Res 70 1793ndash1803

15 SmithJA WhiteEA SowaME PowellML OttingerMHarperJW and HowleyPM (2010) Genome-wide siRNA screenidentifies SMCX EP400 and Brd4 as E2-dependent regulators ofhuman papillomavirus oncogene expression Proc Natl Acad SciUSA 107 3752ndash3757

16 ZhangSL YerominAV ZhangXH YuY SafrinaOPennaA RoosJ StaudermanKA and CahalanMD (2006)Genome-wide RNAi screen of Ca(2+) influx identifies genes thatregulate Ca(2+) release-activated Ca(2+) channel activity ProcNatl Acad Sci USA 103 9357ndash9362

17 FriedmanA and PerrimonN (2006) A functional RNAi screenfor regulators of receptor tyrosine kinase and ERK signallingNature 444 230ndash234

18 GwackY SharmaS NardoneJ TanasaB IugaASrikanthS OkamuraH BoltonD FeskeS HoganPG et al(2006) A genome-wide Drosophila RNAi screen identifies DYRK-family kinases as regulators of NFAT Nature 441 646ndash650

19 BardF CasanoL MallabiabarrenaA WallaceE SaitoKKitayamaH GuizzuntiG HuY WendlerF DasguptaRet al (2006) Functional genomics reveals genes involved inprotein secretion and Golgi organization Nature 439 604ndash607

20 VigM PeineltC BeckA KoomoaDL RabahD Koblan-HubersonM KraftS TurnerH FleigA PennerR et al(2006) CRACM1 is a plasma membrane protein essential forstore-operated Ca2+ entry Science 312 1220ndash1223

21 DasGuptaR KaykasA MoonRT and PerrimonN (2005)Functional genomic analysis of the Wnt-wingless signalingpathway Science 308 826ndash833

22 NybakkenK VokesSA LinTY McMahonAP andPerrimonN (2005) A genome-wide RNA interference screen inDrosophila melanogaster cells for new components of the Hhsignaling pathway Nat Genet 37 1323ndash1332

8 Nucleic Acids Research 2013

at National Institutes of H

ealth Library on D

ecember 12 2013

httpnaroxfordjournalsorgD

ownloaded from