Large Scale Imaging Analytics for In Silico Biomedicine

20
1 Large Scale Imaging Analytics for In Silico Biomedicine Joel Saltz, Fusheng Wang, George Teodoro, Lee Cooper, Patrick Widener, Jun Kong, David Gutman, Tony Pan, Sharath Cholleti, Ashish Sharma, Daniel Brat, Tahsin Kurc Center for Comprehensive Informatics, the Department of Biomedical Informatics, and the Department of Pathology & Laboratory Medicine Emory University Introduction The ability to quantitatively characterize biological structure and function in detail through in silico experiments 1 has great potential to reveal new insights into disease mechanisms and enable the development of novel preventive approaches and targeted treatments. High-resolution microscopy imaging is playing an increasingly pivotal role in realizing this potential in healthcare delivery and biomedical research. Digital microscopy technology reduces dependence on physical slides; it can also enable more effective ways of screening for disease, classifying disease state, understanding its progression, and evaluating the efficacy of therapeutic strategies. Systematic studies of tumors at the cellular and sub-cellular levels, for example, provide tremendous insight as to how alternations in intercellular signaling occur and allow investigators to study the relationship among morphologic characteristics, cellular-level processes, and genetic, genomic, and protein expression signatures. Studies conducted using tissue slide images and genomic data in the In Silico Brain Tumor Research Center[1] have produced results that reveal morphological subtypes of glioblastoma not previously recognized by pathologists and 1 The term “in silico experiment” broadly refers to an experiment performed on a computer by analyzing, mining, and

Transcript of Large Scale Imaging Analytics for In Silico Biomedicine

1

Large  Scale  Imaging  Analytics  for  In  Silico  Biomedicine  

Joel Saltz, Fusheng Wang, George Teodoro, Lee Cooper, Patrick Widener, Jun Kong, David

Gutman, Tony Pan, Sharath Cholleti, Ashish Sharma, Daniel Brat, Tahsin Kurc

Center for Comprehensive Informatics, the Department of Biomedical Informatics, and the

Department of Pathology & Laboratory Medicine

Emory University

Introduction

The ability to quantitatively characterize biological structure and function in detail through in

silico experiments1 has great potential to reveal new insights into disease mechanisms and enable

the development of novel preventive approaches and targeted treatments. High-resolution

microscopy imaging is playing an increasingly pivotal role in realizing this potential in

healthcare delivery and biomedical research. Digital microscopy technology reduces dependence

on physical slides; it can also enable more effective ways of screening for disease, classifying

disease state, understanding its progression, and evaluating the efficacy of therapeutic strategies.

Systematic studies of tumors at the cellular and sub-cellular levels, for example, provide

tremendous insight as to how alternations in intercellular signaling occur and allow investigators

to study the relationship among morphologic characteristics, cellular-level processes, and

genetic, genomic, and protein expression signatures. Studies conducted using tissue slide images

and genomic data in the In Silico Brain Tumor Research Center[1] have produced results that

reveal morphological subtypes of glioblastoma not previously recognized by pathologists and

1 The  term  “in  silico  experiment”  broadly  refers  to  an  experiment  performed  on  a  computer  by  analyzing,  mining,  and  

2

demonstrate significantly correlated genes through correlation of the extent of necrosis and

angiogenesis with gene expression data[2, 3]. In these studies, in silico experiments analyzing

images from 480 tissue slides from 167 patients discovered that the morphological signatures in

glioblastoma self-aggregate into four distinct clusters. The survival characteristics of this

morphology-driven stratification are significant when compared to the survival of molecular

subtypes, suggesting that morphology is a good predictor of prognosis.

Since the first application of digital technology to microscopic data[4], ability to acquire

high-resolution images from whole tissue slides and tissue microarrays has become more

affordable, faster, and practical[5-13]. The latest generation of devices offer advanced dynamic

focus mechanisms to improve auto-focus quality. Cassette-style slide holders that handle slides

indirectly have reduced mishandling of slides and malfunction during scanning. These advances

have made the digital microscopy technology more practical and efficient. Image scanning times

have decreased from 6-8 hours per whole tissue slide about a decade ago to a few minutes with

advanced scanners; and improvements in auto-focusing and slide holders have facilitated high

throughput image generation from batches of slides with minimal manual intervention.

With the improving cost-effectiveness of scanners, it is rapidly becoming possible for a

research study or healthcare operation to routinely generate hundreds to thousands of whole slide

images per day. This progress is fueling the emergence of what we refer to as the analytical

digital pathology, which involves the investigation of quantitative correlation and integration of

high resolution, high throughput datasets comprising imaging features from multiple imaging

modalities and from across temporal, functional, and dimensional scales. However, unlike other

imaging modalities (such as MRI, which enjoys widespread adoption), to date, microscopic

integrating  biomedical  databases  and/or  through  simulations.  

3

imaging has been underutilized in biomedicine. This is primarily because even moderate

numbers of digitized microscopy specimens quickly lead to formidable information synthesis

and management problems. Software to support the extraction and interpretation of information

from thousands of tissue images has to deal with hundreds of terabytes of data, expensive data

processing requirements, and trillions of microscopic objects and their features.

This chapter describes the computational challenges of employing large volumes of

digitized tissue slide data in biomedicine and presents some of the data-intensive computing

approaches we have developed and employed to address these challenges.

Background

A basic application of the microscopy imaging technology is telepathology, in which a

pathologist can remotely render diagnoses for patient care in the absence of glass slides and a

microscope[14, 15]. In this form of use, a whole slide imaging system should support the

implementation of a “virtual microscope”[14, 16-29], which emulates the basic operation of a

microscope, enabling browsing through a slide to locate an area of interest, local browsing in a

region of interest to observe the region surrounding the current view, and changing magnification

level and focal plane. A virtual microscope can also implement functionality that cannot be

achieved with a physical microscope, such as manual annotation of areas of interest and viewing

of a slide by multiple users simultaneously.

While a virtual microscope facilitates remote viewing and sharing of images, the

digitization of microscopy slides has been accompanied by a proliferation of image analysis

methods as well. The objectives of extracting detailed cellular and sub-cellular information from

whole slide images will depend on the particular study being conducted – a study may focus on

reducing variability and error in diagnosis by emulating established diagnostic procedures;

4

another study may target novel insights into the biology of disease progression by investigating

morphological characteristics of the disease. Nevertheless, analytical digital pathology employs a

set of common data processing operations[30]: (1) Stitching and registration. Some instruments

capture a whole slide images as a set of image tiles. These tiles need to be aligned and stitched

together to form a full image. Image registration is the process of mapping two or more images

into the same coordinate frame. (2) Segmentation of objects and regions. Often the entities to be

segmented are composed of collections of simple and complex objects and structures and are

defined by a complex shape and textural appearance. Examples include identifying cell nuclei,

cell membranes, the boundaries of blood vessels, and the extent of regions at the level of tissues.

(3) Feature extraction. It is the process of calculating informative descriptions of objects or

regions, and often precedes classification or segmentation tasks. It can be applied on the whole

image or individual segmented objects to describe characteristics such as shape and texture. (4)

Classification. Segmented objects, regions, or whole slides can be classified into meaningful

groups based on extracted features. Classification of cell types, antibody activation, or entire

slides into pathologic categories are common classification themes. A high level depiction of a

nuclear segmentation and classification workflow is shown in Figure 1.

We should note that this list of operations is a high-level, simplified description of steps

in microscopy image analysis. Individual steps are often composed of a series of sub-steps. There

have been algorithmic advances in improving accuracy of image analysis methods and reducing

their execution times. Nonetheless, computation, storage, and networking still remain significant

challenges in analytical digital pathology. We describe these challenges in the next section.

5

Figure 1. Nuclear segmentation and classification pipeline. Images are processed through a set of

operations for detecting boundaries of nuclei, computing a set of features for each nucleus, and

classifying nuclei into categories using machine learning algorithms on these features. The results are

stored in a database for further analysis and algorithm evaluation.

Data Intensive Computation Challenges in Analytical Digital Pathology

A typical whole slide pathology image contains 20 billion pixels (with digitization at 40X

objective magnification). An 8-bit color uncompressed representation of this typical image

contains 56 GB. A multilayer image stack that provides a focus capability typically contains tens

of such images. In a typical analysis scenario, computational requirements for a single analysis

type (e.g., feature extraction and region classification) are approximately 10 hours for a single

image stack at 5X magnification. Furthermore, a single scanner can generate hundreds of images,

and a study may generate or reference thousands of slides. For example, through the NCI-funded

In Silico Brain Tumor Research Center (ISBTRC) project[1], our team has so far collected 678

slides at 40X and 480 slides at 20X magnification. We will continue to collect approximately

3500 slides from about 700 patients over the next couple of years.

The data intensive computation requirements are exacerbated in large-scale studies

involving thousands of images. Each image analysis pipeline has its strengths and limitations.

The effectiveness of an analysis pipeline depends on many factors including the nature of the

histological structures being segmented, the classifications being carried out, and on sample

6

preparation and staining. It is not feasible to manually inspect each image for every feature and

fine-tune the analysis pipelines in a large-scale study. Thus, detailed multi-scale characterization

of morphology requires (1) coordinated use of many interrelated analysis pipelines, (2)

performing algorithm sensitivity analyses, and (3) comparison of analysis results from multiple

analysis pipelines and analysis runs. For instance, several hundred variations of analysis

pipelines can be evaluated on a few hundred images. Systematic comparison and evaluation of

the results from these runs can help weed out bad pipeline choices and identify a smaller number

of priority pipelines. These pipelines are then executed on the larger collection of images.

This approach leads to a difficult data management, querying, and integration problem.

Image analysis algorithms segment and classify 105 to 107 cells in each virtual slide of size 105

by 105 pixels. Brain tumor tissue analyses, for instance, can encompass the identification and

quantification of mitotic figures and subcellular structures, which is done through processing in

cells or regions identified as being brain tumor, as well as of angiogenesis and pseudopalisades,

which requires a synthesis of regional texture analyses and segmentation of larger scale

histological structures. As these analyses will execute multiple interrelated analysis pipelines as

described above, a systematic analysis of large-scale image data, therefore, involves

classification of roughly 109 to 1012 micro-anatomic structures. The process of classifying a

given cell is done using roughly 10-100 shape, texture, and (when appropriate) stain

quantification features. Thus, a thorough data analysis limited to classifying cells could

encompass 1010 to 1013 features. It could take an hour or longer to compare results generated

from two algorithms for a single image with a database without parallelization. Comparing two

result sets from a hundred images could take a week. In addition to comparing results from

multiple analyses, scalable mechanisms are needed for producing biologically or

7

computationally meaningful data aggregates (e.g., machine learning based clustering) from

spatial objects and features. Computation of data aggregates on large number of images could

take days or weeks.

Data Intensive and High Performance Computing Approaches for Large Scale Analytical

Digital Pathology

Analysis of Large Microscopy Imaging Datasets

Processing of very large images and image datasets require careful coordination of data retrieval,

distribution of data among processing nodes, and mapping of processing tasks to nodes. A

combination of multiple parallelism approaches can be employed to quickly render results from a

large dataset. Multiple images can be processed concurrently in a bag-of-tasks strategy, in which

images or image tiles are assigned to groups of computing nodes, generally using a demand-

driven strategy. Several systems have employed this type of parallelism on cluster systems and in

Grid computing environments. Gurcan et al. reported the successful application of distributed

computing in a pilot project to support automated characterization of Neuroblastoma using the

Shimada classification system[31]. The ImageMiner system employed IBM’s World Community

Grid in July 2006, using more than 100,000 imaged tissue discs[27, 32]. High-resolution images

may not fit in the main memory of a single processor. In addition, image analysis workflows may

consist of operations that can process data in a pipelined, streaming manner. These

characteristics of data and operations are suitable for combined use of task- and data-parallelism.

The out-of-core virtual microscope (OCVM) system[33, 34], based on the DataCutter

infrastructure[35], supports multiple parallelism approaches. In this system, multiple instances of

workflows can be created and executed with each instance processing a subset of images. Within

8

each workflow instance, an image is partitioned into chunks (rectangular sub-regions) so that I/O

operations can be coordinated by the runtime system rather than relying on the virtual memory.

The processing operations constituting the workflow can be mapped to processors to reduce I/O

and communication overheads. Multiple instances of an operation can be instantiated to allow for

data-parallelism and pipelined processing. The execution time of image analyses can further be

reduced when it is acceptable to trade off analysis accuracy for performance – in some case,

there may not be enough resources available to carry out an analysis at the highest resolution or

an exploratory study may not need the highest accuracy to process a large set of potential data

points quickly. A framework to support accuracy-performance trade-offs in microscopy image

analysis has been developed by Kumar et. al.[33] This framework integrates the Wings system

for high-level semantic expression of image analysis workflows, Pegasus for execution of

workflows in a Grid environment, OCVM for fine-grain parallelism and pipelined execution

within a high-performance Grid node, and a set of algorithms for adaptive processing. The

algorithms exploit spatial locality of image features to create dynamic data processing schedules

in order to improve performance while meeting quality of output requirements.

General purpose graphics processing units (GPGPUs) have emerged as a popular

implementation platform for many data-intensive computational science applications. An

increasing number of research projects have developed GPU-accelerated implementations of

image processing operations[36-60]. The GPU implementations of image processing operations

aim to exploit low-latency and high-bandwidth GPU memories and massively multi-threaded

execution models.

Representing and Managing Image Data and Analysis Results

Efficient data repositories anchored on rich and flexible data models play a crucial role in

9

interpretation, reusability, and reproducibility of imaging studies. The underlying data model

should be able to precisely and unambiguously describe image datasets, analyses, and analysis

results. It should be able to represent a) context relating to patient data, specimen preparation,

special stains, etc, b) human observations involving pathology classification and characteristics,

and c) algorithm and human-described segmentations (markups), features, and annotations.

Markups can be either geometric shapes or image masks; annotations can be calculations,

observations, disease inferences or external annotations. The relationships between data elements

can also be complex. For example, additional annotations can be derived from existing

annotations. As a result, generic and extensible metadata models are required to support different

types of analyses and applications.

Several projects have developed data models for representation and management of

microscopy images and analysis results[61-65], although there are yet no official standard

models. The Open Microscopy Environment (OME)[61] project has developed a data model and

a database system that can be used to represent, exchange, and manage image data and metadata.

The OME provides a data model of common specification for storing details of microscope setup

and image acquisition. Cell-Centered Database (CCDB)[64, 65] provides a data model to capture

image analysis output, image data, and information on the specimen preparation and imaging

conditions that generated the image data. DICOM Working Group 26 is developing a DICOM

based standard for storing microscopy images[66]. The metadata in this model captures

information such as patient, study and equipment information. The PAIS model[62, 63] is

designed to provide an object-oriented, extensible, semantically enabled data model to support

large-scale analytical imaging and human observations to be storage and performance efficiency

oriented, and supports alternative implementations. An XML based representation of the PAIS

10

model can be used to facilitate exchange and sharing of results in a format more compatible with

Web standards and tools. However, for very large result sets, the XML representation is not

efficient, even with compression of the documents. An alternative approach is to employ self-

describing structured container technologies such as HDF5. Such container technologies provide

more efficient storage than text-based file formats like XML, while still making available the

structure of the data for query purposes. We have observed in our projects that the HDF5

representation of analysis results is on average 6-7 times smaller, in compressed form, than

compressed XML representation of the same results.

Query Support

An image analysis results repository should be able to allow retrieval of information from a large

number of data elements by a wide range of queries. Many of the data elements are anatomic

objects such as lesions, cells, nuclei, blood vessels, etc. In addition to an object’s classification,

spatial relationships among objects are also important in understanding the biological structure

and function. The examples of query types include: (i) retrieval of image data and metadata (e.g.,

count nuclei where their grades are less than 3); (ii) queries to compare results generated from

different approaches, and validate machine generated results against human observations (e.g.,

find nuclei that are classified by observer A and by algorithm B and whose feature f is within the

range of a and b); (iii) queries on assessing relative prevalence of features or classified objects, or

assessing spatial coincidence of combinations of features or objects (e.g., which nuclei types

preserve nuclei features: distance and shape between two images); (iv) queries to support

selection of collections of segmented regions, features, objects for further machine learning or

content based retrieval applications (e.g., find nuclei with an area between 50 pixels and 200

pixels in selected region of interest); and (v) semantic queries based on spatial relationships and

11

annotations and properties drawn from domain ontologies (e.g., search for objects with an

observation concept astrocytoma and that are within 100 pixels of each other, but also expand to

include all the subclass concepts, gliosarcoma and giant cell glioblastoma, of astrocytoma).

In order to scale to large volumes of data, databases of analysis results can be physically

partitioned across the nodes of a cluster system. Distributed memory can also be leveraged to

reduce I/O costs. We have investigated the performance of different database configurations for

spatial joins and cross-match operations[67]. The configurations included a parallel database

management system with active disk style execution support for some types of database

operations, a database system designed for high-availability and high-throughput (MySQL

Cluster), and a distributed collection of database management systems with data replication. Our

experimental evaluation of cross-match algorithms[68, 69] shows that the choice of a database

configuration can significantly impact the performance of the system. The configuration with

distributed database management systems with replication (i.e., replication of portions of the

database) provides a flexible environment, which can be adjusted to the data access patterns and

dataset characteristics.

Applying the computing infrastructure and software stack, such as MapReduce[70], from

the domain of enterprise data analysis to pathology image analysis has potential to pave the way

for efficient, cost-effective solutions as well. A recent work[71] has demonstrated the

implementation using Hadoop [72, 73] of spatial query processing in analytical digital pathology

as illustrated in Figure 2. The implementation provides a declarative query language and an

efficient real-time spatial query engine with dynamically built spatial indexes to support query

processing on clusters with multi-core CPUs. Processing of a query is accomplished in several

steps: i) Analysis results with spatial boundaries are retrieved by the query engine, and R*-tree

12

indices are built on the fly; ii) Initial spatial filtering (spatial join) is done through minimal

bounding boundary (MBR) based on the entries in the R*-tree indices; and iii) computational

geometry algorithms for query refinement and spatial measurement are performed to generate the

final results. The last step dominates the cost of query execution, and the first two steps

effectively filter non-intersecting spatial objects to minimize geometry computations. To parallel

spatial data processing, data is partitioned based on image tiles, which form natural units for

MapReduce based execution. The data is staged on HDFS[72] (Figure 2a), and the map function

forms tasks based on keys - tiles, and the reduce function calls the query engine to execute

spatial joins (Figure 2b). Support of feature queries are based on Hive[73], which provides a

SQL like query language and supports major aggregation queries running on MapReduce. To

provide an integrated query language, Hive can be extended to support both feature queries and

spatial queries. The MapReduce based approach not only provides high performance spatial

queries on cost-effective clusters, but also makes it convenient for users to write queries.

Tile

Boundary files

Merged boundary file

record(imageID, tileID, boundary)

Copied to HDFS

HDFS

Result of Algorithm 1Result of Algorithm 2

Algorithm results

Image

Data Scan Data Scan Data Scan

MAP

(one reducer per tile)

Spatial Join

REDUCE

(a) HDFS data staging

(b) MapReduce based queries

Figure 2. MapReduce based query processing for result comparison.

13

In addition to spatial query support, semantic query support is also needed, because annotations

on objects may draw from domain ontologies (e.g., cell ontology to describe different cell types,

genome ontology to represent genomic characteristics), creating a semantically rich environment.

An important aspect of semantic information systems is the fact that additional assertions (i.e.,

annotations and classifications) can be inferred from initial assertions (also called explicit

assertions) based on the ontology and the semantics of the ontology language. This facilitates a

more comprehensive mechanism for exploration of experiment results in the context of domain

knowledge. Query execution and on-the-fly computation of assertions may take too long on a

single processor machine. Pre-computation of inferred assertions, also referred to as the

materialization process, can reduce the execution of subsequent queries. Combined use of

semantic stores[74-76] and rule engines[77] can offer a repository system capable of evaluating

spatial predicates and rules. In such a system, the rule engine and the semantic store/inference

engine interact to compute inferred assertions based on the ontology in the system, the set of

rules, and the initial set of explicit assertions (annotations). Rules that utilize spatial relationships

might generate new instances of ontological concepts based on the evaluation of the rules.

Execution strategies leveraging high-performance parallel and distributed machines can reduce

execution times and speed up the materialization process for very large datasets[78, 79]. One

possible strategy is to employ data parallelism by partitioning the space in which the spatial

objects are embedded. Another parallelization strategy is to partition the ontology axioms and

rules, distributing the computation of axioms and rules to processors. This partitioning would

enable processors to evaluate different axioms and rules in parallel. Inter-processor

communication might be necessary to ensure correctness. This parallelization strategy attempts

14

to leverage axiom-level parallelism and will benefit applications where the ontology contains

many axioms with few dependencies. A third possible strategy is to combine the first two

strategies with task-parallelism. In this strategy, N copies of the semantic store engine and M

copies of the rule engine are instantiated on the parallel machine. The system coordinates the

exchange of information and the partitioning of workload between the semantic store engine

instances and the rule engine instances. The numbers N and M will depend on the cost of the

inference execution as well as the partitioning of the workload based on spatial domain and/or

ontology axioms.

Discussion and Conclusions High-resolution, high-throughput instruments are being employed routinely not only in medical

science, but also in health care delivery settings at an accelerating rate. As this decade

progresses, significant advances in medical information technologies will transform very large

volumes of multi-scale, multi-dimensional data into actionable information to drive the

discovery, development, and delivery of new mechanisms of preventing, diagnosing, and healing

complex disease. Data produced by advances in digitization and image analysis are outpacing the

storage and computation capacities of workstations and small cluster systems. The “big data”

from image analysis has similar high performance and scalability requirements as enterprise

healthcare data, but presents unique challenges. In the future, even medium scale hospitals and

research projects will require capabilities to manage thousands of high-resolution images,

execute and manage interrelated analysis pipelines, and query trillions of microscopic objects

and their features. These applications demand fast loading and query response, as well as

declarative query interfaces for high usability.

Computational systems with multiple levels of computing and memory hierarchies, such

15

as high performance computing systems consisting of multi-core CPUs and multiple GPUs and

composed of multiple levels of coupled spinning drives and SSDs in RAID configurations, are

becoming mainstream configurations, replacing more traditional homogeneous computing

clusters. These systems offer tremendous computing power and low-latency and high-throughput

I/O capabilities. Many challenges, however, remain for the effective use of these new

technologies. Novel storage, indexing, data staging, and scheduling techniques and middleware

support are needed to manage storage hierarchies in tandem with executing computations on

heterogeneous systems of CPU-GPU nodes. There have been substantial advances in network

switches and networking protocols for intra-cluster communications. Technologies such as

Infiniband provide low-latency, high-bandwidth communication substrates. However, progress

in wide-area networking has been relatively slow. While multi-Gigabit networks are becoming

more widely deployed within institutions, access to remote resources is still hindered by slow,

high-latency networks. Efficient compression, progressive data transmission, and intelligent data

caching and computation reuse methods will continue to play critical roles in enabling digital

pathology and scientific collaborations involving large pathology image datasets.

Acknowledgments. This work was supported in part by SAIC/NCI Contract No. HHSN261200800001E and N01-CO-12400 from the National Cancer Institute, R24HL085343 from the National Heart Lung and Blood Institute, by Grants 1R01LM011119-01 and R01LM009239 from the National Library of Medicine, RC4MD005964 from National Institutes of Health, PHS Grant UL1RR025008 from the Clinical and Translational Science Awards program, and P20 EB000591 by the Biomedical Information Science and Technology Initiative program. References 1. Saltz, J., et al., Multi-Scale, Integrative Study of Brain Tumor: In Silico Brain Tumor

Research Center. Proceedings of the Annual Symposium of American Medical Informatics Association 2010 Summit on Translational Bioinformatics (AMIA-TBI 2010), San Francisco, LA., 2010.

2. Cooper, L.A., et al., An integrative approach for in silico glioma research. IEEE Trans Biomed Eng, 2010. 57(10): p. 2617-21.

3. Cooper, L.A.D., et al., Morphological Signatures and Genomic Correlates in

16

Glioblastoma, in International Symposium on Biomedical Engineering2011, IEEE: Chicago.

4. Mayall, B.H. and M.L. Mendelsohn, Deoxyribonucleic acid cytophotometry of stained human leukocytes. II. The mechanical scanner od CYDAC, the theory of scanning photometry and the magnitude of residual errors. J Histochem Cytochem, 1970. 18(6): p. 383-407.

5. Eide, T.J., I. Nordrum, and H. Stalsberg, The validity of frozen section diagnosis based on video-microscopy. Zentralbl Pathol, 1992. 138(6): p. 405-7.

6. Eide, T.J. and I. Nordrum, Frozen section service via the telenetwork in northern Norway. Zentralbl Pathol, 1992. 138(6): p. 409-12.

7. Kaplan, K.J., et al., Use of robotic telepathology for frozen-section diagnosis: a retrospective trial of a telepathology system for intraoperative consultation. Mod Pathol, 2002. 15(11): p. 1197-204.

8. Nordrum, I., et al., Remote frozen section service: a telepathology project in northern Norway. Hum Pathol, 1991. 22(6): p. 514-8.

9. Fey, E.G. and S. Penman, The morphological oncogenic signature. Reorganization of epithelial cytoarchitecture and metabolic regulation by tumor promoters and by transformation. Dev Biol (N Y 1985), 1986. 3: p. 81-100.

10. Weinstein, R.S., K.J. Bloom, and L.S. Rozek, Telepathology and the networking of pathology diagnostic services. Arch Pathol Lab Med, 1987. 111(7): p. 646-52.

11. Weinstein, R.S., et al., Overview of telepathology, virtual microscopy, and whole slide imaging: prospects for the future. Hum Pathol, 2009. 40(8): p. 1057-69.

12. Williams, S., et al., Telepathology for patient care: what am I getting myself into? Adv Anat Pathol, 2010. 17(2): p. 130-49.

13. Rojo, M.G., et al., Critical comparison of 31 commercially available digital slide systems in pathology. Int J Surg Pathol, 2006. 14(4): p. 285-305.

14. Wilbur, D.C., et al., Whole-slide imaging digital pathology as a platform for teleconsultation: a pilot study using paired subspecialist correlations. Arch Pathol Lab Med, 2009. 133(12): p. 1949-53.

15. Gilbertson, J.R., et al., Primary histologic diagnosis using automated whole slide imaging: a validation study. BMC Clin Pathol, 2006. 6: p. 4.

16. Afework, A., et al., Digital dynamic telepathology--the Virtual Microscope. Proc AMIA Symp, 1998: p. 912-6.

17. Catalyurek, U., et al., The Virtual Microscope. IEEE Transactions on Information Technology in Biomedicine, 2003. 7(4): p. 230--248.

18. Ferreira, R., et al., The Virtual Microscope. Proc AMIA Annu Fall Symp, 1997: p. 449-53.

19. Balis, U.J., Telemedicine and telepathology. Clin Lab Med, 1997. 17(2): p. 245-61. 20. Dziegielewski, M., G.M. Velan, and R.K. Kumar, Teaching pathology using 'hotspotted'

digital images. Med Educ, 2003. 37(11): p. 1047-8. 21. Farah, C.S. and T. Maybury, Implementing digital technology to enhance student

learning of pathology. Eur J Dent Educ, 2009. 13(3): p. 172-8. 22. Furness, P.N., The use of digital images in pathology. J Pathol, 1997. 183(3): p. 253-63. 23. Guzman, M. and A.R. Judkins, Digital pathology: a tool for 21st century

neuropathology. Brain Pathol, 2009. 19(2): p. 305-16. 24. Leong, F.J. and A.S. Leong, Digital imaging in pathology: theoretical and practical

17

considerations, and applications. Pathology, 2004. 36(3): p. 234-41. 25. Marchevsky, A.M., et al., Storage and distribution of pathology digital images using

integrated web-based viewing systems. Arch Pathol Lab Med, 2002. 126(5): p. 533-9. 26. Saltz, J.H., Digital pathology--the big picture. Hum Pathol, 2000. 31(7): p. 779-80. 27. Yang, L., et al., Virtual microscopy and grid-enabled decision support for large-scale

analysis of imaged pathology specimens. IEEE Trans Inf Technol Biomed, 2009. 13(4): p. 636-44.

28. Zheng, L., et al., Design and analysis of a content-based pathology image retrieval system. IEEE Trans Inf Technol Biomed, 2003. 7(4): p. 249-55.

29. Hadida-Hassan, M., et al., Web-based telemicroscopy. J Struct Biol, 1999. 125(2-3): p. 235-45.

30. Gurcan, M.N., et al., Histopathological Image Analysis: A Review. IEEE Rev Biomed Eng, 2009. 2: p. 147-171.

31. Gurcan, M.N., et al., Computerized pathological image analysis for neuroblastoma prognosis. AMIA Annu Symp Proc, 2007: p. 304-8.

32. Yang, L., et al., High throughput analysis of breast cancer specimens on the grid. Med Image Comput Comput Assist Interv Int Conf Med Image Comput Comput Assist Interv, 2007. 10(Pt 1): p. 617-25.

33. Kumar, V., et al., An Integrated Framework for Parameter-based Optimization of Scientific Workflows (accepted for publication), in The 18th International Symposium on High Performance and Distributed Computing (HPDC 2009)2009: Germany.

34. Kumar, V., et al., Large-scale biomedical image analysis in grid environments. IEEE Transactions on Information Technology in Biomedicine, 2008. 12(2): p. 154-161.

35. Beynon, M., et al., Distributed Processing of Very Large Datasets with DataCutter. Parallel Computing, 2001. 27(11): p. 1457-2478.

36. Meilander, D., et al., Parallel medical image reconstruction: from graphics processing units (GPU) to Grids. Journal of Supercomputing, 2011. 57(2): p. 151-160.

37. Ying, Z.G., et al., Parallel fuzzy connected image segmentation on GPU. Medical Physics, 2011. 38(7): p. 4365-4371.

38. Huang, M.C., F. Liu, and E.H. Wu, A GPU-based matting Laplacian solver for high resolution image matting. Visual Computer, 2010. 26(6-8): p. 943-950.

39. Shams, R., et al., A Survey of Medical Image Registration on Multicore and the GPU. Ieee Signal Processing Magazine, 2010. 27(2): p. 50-60.

40. Abramov, A., et al., Real-Time Image Segmentation on a GPU. Facing the Multicore-Challenge: Aspects of New Paradigms and Technologies in Parallel Computing, 2010. 6310: p. 131-142.

41. Singhal, N., I.K. Park, and S. Cho, Implementation and Optimization of Image Processing Algorithms on Handheld Gpu. 2010 Ieee International Conference on Image Processing, 2010: p. 4481-4484.

42. Zhang, N., J.L. Wang, and Y.S. Chen, Image Parallel Processing Based on GPU. 2nd Ieee International Conference on Advanced Computer Control (Icacc 2010), Vol. 3, 2010: p. 367-370.

43. Herout, A., et al., GP-GPU Implementation of the "Local Rank Differences" Image Feature. Computer Vision and Graphics, 2009. 5337: p. 380-390.

44. Allusse, Y., et al., GpuCV: A GPU-Accelerated Framework for Image Processing and

18

Computer Vision. Advances in Visual Computing, Pt Ii, Proceedings, 2008. 5359: p. 430-439.

45. Xu, Z.P. and W.B. Xu, GPU in texture image processing. Dcabes 2006 Proceedings, Vols 1 and 2, 2006: p. 380-383.

46. Schmeisser, M., et al., Parallel, distributed and GPU computing technologies in single-particle electron microscopy. Acta Crystallographica Section D-Biological Crystallography, 2009. 65: p. 659-671.

47. Crookes, D., et al., Gpu Implementation of Map-Mrf for Microscopy Imagery Segmentation. 2009 Ieee International Symposium on Biomedical Imaging: From Nano to Macro, Vols 1 and 2, 2009: p. 526-529.

48. Tan, G.M., et al., Single-particle 3D Reconstruction from Cryo-Electron Microscopy Images on GPU. Ics'09: Proceedings of the 2009 Acm Sigarch International Conference on Supercomputing, 2009: p. 380-389.

49. Hartley, T.D.R., et al., Investigating the Use of GPU-Accelerated Nodes for SAR Image Formation. 2009 Ieee International Conference on Cluster Computing and Workshops, 2009: p. 663-670.

50. Ruiz, A., et al., Pathological image analysis using the GPU: Stroma classification for neuroblastoma. 2007 Ieee International Conference on Bioinformatics and Biomedicine, Proceedings, 2007: p. 78-85.

51. Teodoro, G., et al., Coordinating the Use of GPU and CPU for Improving Performance of Compute Intensive Applications. 2009 Ieee International Conference on Cluster Computing and Workshops, 2009: p. 437-446.

52. Coutinho, B.R., et al. Profiling general purpose gpu applications. in Computer Architecture and High Performance Computing, 2009. SBAC-PAD'09. 21st International Symposium on. 2009. IEEE.

53. Saltz, J.H., et al., Feature-based analysis of large-scale spatio-temporal sensor data on hybrid architectures. International Journal of High Performance Computing Applications, 2013. 27(3): p. 263-272.

54. Teodoro, G., Efficient Execution of Dataflows on Parallel and Heterogeneous Environments. Distributed Computing Innovations for Business, Engineering, and Science, 2012: p. 1.

55. Teodoro, G., et al. Run-time optimizations for replicated dataflows on heterogeneous environments. in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. 2010. ACM.

56. Teodoro, G., et al., Optimizing dataflow applications on heterogeneous environments. Cluster Computing, 2012. 15(2): p. 125-144.

57. Teodoro, G., et al. Accelerating large scale image analyses on parallel, CPU-GPU equipped systems. in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. 2012. IEEE.

58. Teodoro, G., et al., A Fast Parallel Implementation of Queue-based Morphological Reconstruction using GPUs. Technical Report, CCI-TR-2012-2, Center for Comprehensive Informatics, Emory University., 2012.

59. Teodoro, G., et al., High-throughput execution of hierarchical analysis pipelines on hybrid cluster platforms. arXiv preprint arXiv:1209.3332, 2012.

60. Ü. V. Çatalyürek, R.F., T. D. R. Hartley, R, George Teodoro, Data Flow Frameworks for Emerging Heterogeneous Architectures and Their Application to Biomedicine, 2010.

19

61. Goldberg, I.G., et al., The Open Microscopy Environment (OME) Data Model and XML file: open tools for informatics and quantitative analysis in biological imaging. Genome Biol, 2005. 6(5): p. R47.

62. Foran, D.J., et al., ImageMiner: a software system for comparative analysis of tissue microarrays using content-based image retrieval, high-performance computing, and grid technology. Journal of the American Medical Informatics Association : JAMIA, 2011. 18(4): p. 403-15.

63. Wang, F., et al., A data model and database for high-resolution pathology analytical image informatics. Journal of pathology informatics, 2011. 2: p. 32.

64. Martone, M.E., et al., A cell-centered database for electron tomographic data. J Struct Biol, 2002. 138(1-2): p. 145-55.

65. Martone, M.E., et al., The cell-centered database: a database for multiscale structural and protein localization data from light and electron microscopy. Neuroinformatics, 2003. 1(4): p. 379-95.

66. DICOM. Digital Imaging and Communications in Medicine. 2011 [cited 2011 May]; Available from: http://medical.nema.org/.

67. Kumar, V., et al., Architectural Implications for Spatial Object Association Algorithms. the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 09), Rome, Italy, 2009.

68. Gray, J., M. Nieto-Santisteban, and A. Szalay, The zones algorithm for finding points-near-a-point or cross-matching spatial datasets. The ACM Computing Research Repository (CoRR), abs/cs/0701171, 2007.

69. Becla, J., et al., Organizing the extremely large LSST database for real-time astronomical processing. 17th Annual Astronomical Data Analysis Software and Systems Conference (ADASS 2007), London, England., 2007.

70. Dean, J. and S. Ghemawat, MapReduce: Simplified data processing on large clusters. USENIX Association Proceedings of the Sixth Symposium on Operating Systems Design and Implementation (OSDE '04), 2004: p. 137-149.

71. Wang, F., et al., Hadoop-GIS: A High Performance Query System for Analytical Medical Imaging with MapReduce. Technical Report, CCI-TR-2011-3, Center for Comprehensive Informatics, Emory University. , 2011.

72. Shvachko, K., et al., The Hadoop Distributed File System. 2010 Ieee 26th Symposium on Mass Storage Systems and Technologies (Msst), 2010.

73. Thusoo, A., et al., Hive - A Petabyte Scale Data Warehouse Using Hadoop. 26th International Conference on Data Engineering Icde 2010, 2010: p. 996-1005.

74. Wilkinson, K., et al., Efficient RDF storage and retrieval in Jena2. Proceedings of VLDB Workshop on Semantic Web and Databases, 2003: p. 131-150.

75. Broekstra, J., A. Kampman, and F. van Harmelen, Sesame: A generic architecture for storing and querying RDF and RDF schema. International Semantic Web Conference, Lecture Notes in Computer Science, 2002(2342): p. 54--68.

76. Kiryakov, A., D. Ognyanov, and D. Manov, OWLIM - A pragmatic semantic repository for OWL. WISE Workshops, volume 3807 of Lecture Notes in Computer Science, 2005: p. 182-192.

77. Hill, E.F., Jess in Action: Java Rule-Based Systems. 2003: Manning Publications Co., Greenwich, CT, USA.

78. Narayanan, S., et al., Parallel Materialization of Large ABoxes. the 24th Annual ACM

20

Symposium on Applied Computing (SAC 2009), Hawaii, USA, March., 2009. 79. Kumar, V., et al., Analysis and Semantic Querying in Large Biomedical Image Datasets.

IEEE Computer Magazine, special issue on Data-Intensive Computing, 2008. 41(4): p. 52-59.