Information Considerations - Management of Enterprise Data

19
CRF-RDTE-TR-20100202-08 11/2/2009 Public Distribution| Michael Corsello CORSELLO RESEARCH FOUNDATION INFORMATION CONSIDERATIONS MANAGEMENT OF ENTERPRISE DATA

Transcript of Information Considerations - Management of Enterprise Data

CRF-RDTE-TR-20100202-08

11/2/2009

Public Distribution| Michael Corsello

CORSELLO

RESEARCH

FOUNDATION

INFORMATION CONSIDERATIONS MANAGEMENT OF ENTERPRISE DATA

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Abstract Information management is a complex topic involving not just information technology, but business

processes and staffing as well. Enterprise Information Management includes every aspect of the

creation, storage, use, disposal and accountability for every piece of information an organization comes

in contact with.

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Table of Contents Abstract ......................................................................................................................................................... 2

Introduction .................................................................................................................................................. 5

Audiences ...................................................................................................................................................... 5

Business Domains ......................................................................................................................................... 6

Data Categories ............................................................................................................................................. 7

Raw Data ................................................................................................................................................... 7

Accuracy and Precision ......................................................................................................................... 8

Faults and Blunders ............................................................................................................................... 8

Data Capture ......................................................................................................................................... 9

Human Capture ..................................................................................................................................... 9

Sensor Capture ...................................................................................................................................... 9

Derived Data ........................................................................................................................................... 10

Data Uses .................................................................................................................................................... 10

Data Structure Types................................................................................................................................... 11

Unstructured (Documents) ..................................................................................................................... 11

Semi-Structured ...................................................................................................................................... 12

Structured ............................................................................................................................................... 12

Structure Element Types ......................................................................................................................... 12

Textual ................................................................................................................................................. 12

Tabular ................................................................................................................................................ 12

Graph Data .......................................................................................................................................... 13

Spatial .................................................................................................................................................. 13

Temporal ............................................................................................................................................. 14

Data Formats ............................................................................................................................................... 15

File Formats ............................................................................................................................................. 15

Stability ................................................................................................................................................... 15

Longevity ................................................................................................................................................. 15

Data ..................................................................................................................................................... 16

Storage ................................................................................................................................................ 16

Format ................................................................................................................................................. 16

Media .................................................................................................................................................. 16

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Data Volumes .............................................................................................................................................. 16

Separation ............................................................................................................................................... 17

Sharding .................................................................................................................................................. 17

Versioning ............................................................................................................................................... 17

Archival ................................................................................................................................................... 18

Evaporation ............................................................................................................................................. 18

Conclusions ................................................................................................................................................. 19

Appendices .................................................................................................................................................. 19

Acronym List ........................................................................................................................................... 19

References .............................................................................................................................................. 19

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Introduction Organizational management of information is a long-standing issue that has not been solved by

technology. Increasing capabilities to generate data and greater capacities for storing data have

become available as technology advances. This new information glut presents problems to all

organizations that create or use information. The storage of data is a multi-dimensional problem that

has to be considered from many aspects:

Audiences, both intended and unintended

Business domains, including legal and policy restrictions

Data categories, such as sensor feeds, field sampling, analytic results

Data uses, which includes activities such as data capture, analysis and reporting

Data structure types, such as document, tabular, raster, spatial or temporal

Data formats, including proprietary formats that may be restrictive in use

Data volume over time, which will drive the infrastructure to store data

Each of these aspects alone will only provide a partial set of demands and restrictions to the handling of

data for the enterprise. Given that an organization “in the large” is both an enterprise and a part of

larger enterprises that include all other organizations in a similar business domain, it is imperative that

the organization information strategy meet both short and long-term demands of the enterprises that

the organization interacts with.

Audiences Information is useless without an audience. Even in automated processing systems and control systems,

an audience receives the result of the information processing. The benefit to the audience is the

purpose for the information. There are two primary categories of audience when dealing with

information in any form, intended and unintended.

The intended audience is divided further into the two

categories, direct and indirect. The direct intended

audience is the portion of the audience using an

information store which the store was designed to

directly support. This is the primary user base of the

information. The indirect intended audience is the set of

users (and domains) which were planned for during

information store design. This audience will have varying

levels of success in using the information and may or may not be an actual set of users (there may not

be any people in this set that actually use the information). It is not uncommon that the indirect

intended audience only amounts to a small portion of the user base and may be incorrectly targeted

during the design phase.

The unintended audience is the set of information users which the information store was not planned or

designed for. The unintended audience is likewise divided into the two categories, direct and indirect.

Figure 1. Top-Level Categories of Users

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

The direct unintended audience is the portion of the audience for an information store that is within a

business domain or organization that correlates well with an intended group. The direct unintended

audience will generally meet with moderately good

success in information usage with the primary limitations

being access to the information. Finally, the indirect

unintended audience is the set of users that are not

related to an expected domain. This form of user may

include orthogonal business areas (such as a construction

firm using business demographic data for site selection)

and may gain significant benefits from information usage.

In fact, it is entirely possible that this group of users may out number and out benefit all other user

groups. Since this group is not intended, they are the most difficult to support. These groups may have

any form of software applications and require data in

specific formats. Unfortunately, data may be provided in

incompatible formats which greatly hinders reuse. While

the use of open-source data standards (such as XML

schemas) and web services help cross-platform

information exchanges, they do not solve the problems of

data formats for specific applications.

Business Domains The notion of a business domain is a business functional area that will have a related goal and use

similar information. A single organizational group may operate in multiple business domains and

therefore have information stores that overlap those multiple business domains.

A business domain such as hydraulic engineering,

geodetic survey, ecological monitoring or commercial

fishing will produce and use information that overlaps

with other domains. In fact, the specific domains

mentioned in this paragraph all overlap in significant

ways. For example, all of the above domains utilize

information about fixed monuments for relative

locations, which are the heart of surveying. Likewise, all

of these domains will have use for information about

locations in general. Further, these groups will all use

information on projects. First, all of the domains other

than commercial fishing would have direct efforts on the construction of a lock, dam or channel. The

commercial fishing domain may then be an unintended audience utilizing data for planning trawler

capacities and travel schedules for safe cargo passage. In this case, the commercial fishing domain is

simply acting as a waterway cargo hauling domain.

Figure 4. Business Domains with Commonality

Figure 2. Types of Intended Users

Figure 3. Relative Magnitude of User Types

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

The ultimate consideration for business domains is information structure, use and access. Information

structure, use and access are the themes that run through the remainder of this paper to describe the

“what”, “where”, “when”, “why” and “how” of interacting with the information produced by an

organization. One of

the most important

considerations with

business domains will

be the isolation of

domains from one

another in space and

time. In general,

each organization will

operate at specific

locations and

produce and maintain

information for a time

period that is relevant to that organization. Policy or legal issues relating to the information being

regulated (such as personally identifiable information (PII)) may drive organizational isolation. Also,

each organization will isolate its own information technology infrastructure from the global internet for

security purposes. Security and auditing considerations must be addressed when information must flow

across these boundaries.

Data Categories First, a separation between data and information must be established. Information is the collection of

data, tools and a context which a human will use to derive knowledge. Data is the collection of stored

values used as information when presented in context. Data may imply context based upon a storage

model or management tool (such as a software application). However, data can be moved between

models to alter or infer other contexts.

Each category of information that the organization will produce or manage will affect the information

modeling for that organization. Data categories include raw data sources such as sensors and field

collection and processed sources such as analysis results. A data category will influence the model used

to store that information and may greatly affect the theoretical rate of production for that data.

Raw Data Raw data includes all categories of data that are collected directly as a result of observation of a physical

phenomenon in the real world. Raw data is the original form of data from a source and must be

considered the basis for all derived data sets that are derived using a given raw source. Raw data is

generally collected in a higher volume than derived data and has the greatest potential for reuse of any

form of data.

Figure 5. Data Domains Shared Across Business Domains

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Prior to detailing categories of raw data, it is important to appreciate sources of error and

incompatibility between data sets. Errors in data will fall into two primary areas: measurement error

and introduced error. Measurement error is in the form of accuracy and precision of the measurement

made. Introduced error is in the form of faults and blunders in data collection and entry generally

associated with human interpretation and capture.

Accuracy and Precision

Accuracy and precision are commonly confused in spite of their relative simplicity. Accuracy is the

absolute measure of how close to the actual value a measurement is. It is often described as “how close

a dart is to the bull’s-eye”, whereas precision is the repeatability of a measure. Precision is the

closeness of sequential measurements to each other irrespective of the actual value being measured. In

shooting terms, precision is measured as the “size of the group” for sequential shots. It is therefore

quite possible to have low accuracy (missed the target by 12 inches) and high precision (95% of all shots

were within 0.0001 inches of each other).

In general, sensors are calibrated for accuracy and have intrinsic precision properties that may not be

adjustable over the lifespan of the sensor. In many cases, the loss of precision in a sensor is the

indication that the sensor is at its end of effective life. Sensors with reduced accuracy may introduce a

systematic error, which can potentially be corrected by adding an “error offset”.

Accuracy and precision are of concern in both human and sensor measurements. Human

measurements tend to be more stochastic in terms of both accuracy and precision, but not necessarily

of less value.

Generic Accuracy

In general terms, accuracy should be recorded for all raw data collected regardless of source or data

type. These generic accuracy statements are metadata (data about data) attached to all data collected

under a similar generic accuracy. The inclusion of an accuracy statement with all raw data sets will

ensure that data users can determine the maximum level to which derived data could be considered

accurate. No derived data set can be of greater accuracy than the least accurate source data used to

derive the result. This has profound implications in analysis where partial data sets must be integrated

to form a complete picture from which analytical results are derived.

Faults and Blunders

When a data value is not captured correctly, it may be due to a fault or blunder. A fault is the capture of

a correct measure or value in an incorrect manner, such as transposing latitude and longitude. A

blunder is the capture of an incorrect value, either by incorrect measurement or by incorrect entry.

Incorrect measurement is a gap between the physical environment and the data system (such as a

reading from a GPS on a tripod that is not leveled properly). Incorrect entry is a gap between the source

data and a destination data system (such as incorrectly entering a value into a database or notebook by

transposing digits).

Faults and blunders are generally caused by humans and may exist anywhere in the data lifecycle.

Faults and blunders are random and not easily detected or corrected.

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Data Capture

Raw data may be captured by humans or by automated means such as sensors. When sensors are used,

they may perform direct electronic capture or require a human to record and transfer the sensed data.

Any data capture that requires a human to record the data will be classed as human capture equal to all

other human captured data.

Regardless of the mechanism of data capture, raw data should be considered “gospel” data that must be

retained for the longest period of time. If the raw data used in an analysis is lost, that analysis could not

be verified and the validity of the analysis will be ultimately vulnerable to scrutiny. Unfortunately, the

volume of raw data may be prohibitive to traditional storage for long-term. In this case, a mechanism

for data persistence will be required that allows for long-term persistence to support the life time of all

analysis products.

Human Capture

Human data capture is the practice of a person measuring, recording and entering values into a data set.

Human capture is still a common form of data generation and includes all business data entry systems

such as “online forms” or general “spreadsheets”. Since a human is collecting the data and entering it

into a system, the maximum number of people capturing and entering data and the absolute rate at

which a single person can generate this data limits the rate of generation. Human capture systems are

generally low to medium volume capture sources.

Sensor Capture

Sensor systems can be direct capture or indirect capture. A direct capture system records all data

sensed directly to a persistent store such as a database. Direct capture systems often include sensors

such as:

Fixed thermal sensors (weather stations)

Closed circuit cameras

Supervisory Control and Data Acquisition (SCADA) systems

Hardware self monitoring

Software logging / auditing

Indirect capture systems include all disconnected sensor systems that require periodic data transfers

such as field data loggers. Indirect capture systems include sensors such as:

Hydrolab Datasondes

Hardware diagnostic sensors (such as those in automobiles)

GPS units

Autonomous hardware devices (such as robots, transponder readers / loggers, etc)

Sensor capture devices tend to provide reliable data if properly used and maintained. Due to the

automated nature of these devices they can have low to extremely high data capture rates. Indirect

capture devices are limited in capture rate due to the transfer requirements, but this can be overcome

by device exchanging.

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Derived Data All data products that are created from existing data are considered derived. Derived data will often be

smaller in size and greater in value to a user than the source data used to produce the derived data.

Derived data may be produced dynamically “on-demand” (such as complex queries over a relational

data store), or statically produced and persisted (such as analysis results and formal reports). Whenever

possible, metadata should be captured and maintained for all derived data indicating its chain of

production and all data sets used in the derivation of the product.

Data Uses Each user will have a set of uses for a data set. The basic forms of data use include capture, view,

analysis (derivation) and reporting (derivation). The initial use of a data set will be data capture or

generation, which involves the processes of users that produce and record the raw data.

Data view (reading or retrieval) is the practice of physically viewing, querying, browsing or enumerating

a data set. Viewing is the most common use of data. Data viewing is subject to availability, discovery

and transfer constraints. The user context may indicate the presentation format of data and therefore

may require a transformation of the data from its storage format.

Analysis is the act of deriving a data set from some other data set or sets. Analysis often requires data

to be in a specific structure or format that is amenable to the specific type of analysis to be performed.

Additionally, data volume may be a consideration that limits the ability to transform the data due to

storage limitations. Finally, analysis processes may require data transfer at specific rates or access

methodologies (such as random or sequential access) that drive the data storage environment (such as

server sizing). This tends to be the most performance critical use of data and may significantly affect

costs and planning.

Reporting is the act of deriving an output data set from an existing data set. Reporting generally adds

minimal “new” information (generally just aggregation functions such as averages or sums), but tends to

add significant value due to improved information context and comprehension. Reporting may be

automated (scheduled or triggered) or manual. Reporting may involve data transformation and

additional storage (such as in an online analytic processing (OLAP) system).

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Figure 6. Relative size of user communities

All forms of data use will influence storage methodologies, environment scaling and access control.

Integration of external application systems can be the most significant cost of time and money in

architecting an effective information strategy. It is important to understand the scale of the users as

well. In any information system, the largest proportion of users is viewers, while the smallest

proportion of users is creators. Given this pyramid (as depicted above), it is clear that data creation is

targeted at a small set of users, which have specialized needs. Due to the small size of the community, it

is reasonable to invest greater funds per user to create robust data collection tools. If the smaller

communities are not emphasized, the data available to users will be of lesser quality. In general, it is a

more effective utilization of resources to invest in collection and analysis tools, with general purpose

viewing tools being created as needed.

Data Structure Types All data will be structured in a manner that a computer can handle. These structures will abstract the

physical world to affect the information capability the organization requires. Data structure types

include three primary levels; unstructured, semi-structured and structured.

Unstructured (Documents) Unstructured data is most commonly “documents”. These data types store values that have embedded

meaning with an absence of structure for conveying that meaning. A document can be understood by

reading it and may have many meanings depending upon the reader. Since there is no pre-defined

structure to the content, a computer treats the content as a single atomic data item. The content within

the document may be mined to acquire limited information (generally metadata) beyond that which

may be entered by the creator. Unstructured data is of limited use within a data set, but provides

immense use to human users. Unstructured data is primarily managed in an information store as a

document repository flagged with metadata relating the document to other data.

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Semi-Structured Semi-structured data is data containing a limited structural definition. XML is a standard encoding for

data that may be externally structured via an XML Schema. Semi-structured data may be fully

conformant with a single structural definition, or it may be conformant in part with multiple structural

definitions. Semi-structured data will be a consideration for implementation use and consumption for

externally provided data.

Structured Structured data is all data stored under a pre-defined, well-known structure. Relational databases, or

object-oriented classes, define structured data. Structured data is the most powerful aspect of data

storage and modeling as it conveys structure and implies meaning of and between data elements.

In most cases, an analysis will require specific data in a specific structure, but will not semantically

comprehend the data used. Structured storage and transfer of data does not imply meaning or

guarantee correctness. All data structures persisted within an organization repository should be

structured storage unless otherwise required. Semi-structured data and unstructured data may be

stored within a structured repository.

Structure Element Types Within a structured or semi-structured store, each data element will be of some structured type.

Primary data types include; Boolean (true/false), integral numeric, real numeric, text, binary and

abstract type. The abstract types are constructed as a defined collection of the other primary data

types. All primary data types are supported in some form in every language, database and general

information system.

Textual

A textual data element is any free-form text entity that is interpreted as a whole. In general, a

document can be considered primarily a textual entity. Note that specific document encoding formats

(such as the Microsoft Word format) may include far more than text and be dependent upon a specific

tool to read the data in this format. General purpose textual information (including HTML and XML files)

are encoded in a standard form that is easily understood in any platform.

Tabular

Tabular data elements are any data structured as rows and columns of values. Object-oriented classes

are also considered as a tabular structure in that the class is analogous to a table structure. The key

concept of tabular data is that of a pre-defined structure for all instances of data conforming to this

structure. Any data structure that is a two-dimensional array of values is a table. All digital images may

be considered tabular data in their raw form, it is only their encoding (such as JPEG or TIFF) that makes

them distinct elements. Further, a tabular data structure may store any other form of data within a

single column of that structure.

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Graph Data

Graph data elements are any data structured upon relationships between elements. This may be in the

form of hierarchies (trees) or networks (graphs). A graph (G) is a set of nodes (N) and edges (E),

indicated as G(N,E) where each edge (e) in the set of edges (E) is a directed pair of nodes (n1,n2) both in

the set of nodes (N). Edges may be directed as source – destination or bi-directional (considered

“undirected”). A social network like Facebook is an undirected graph connecting people (nodes) to each

other that indicate they know each other (edges). A road or river network would also be examples of

graphs.

Graph data may also be tabular, in that the nodes and edges may each be represented by a table. It is

also possible for a set of nodes to be shared across multiple graphs, each with a distinct set of edges.

Graph data is conceptually simple, but is potentially complex and may be used to support any number of

complex analytical routines. In general, anything representing connectivity can be considered as a

graph. Most commercially available data management tools including relational databases have poor

support for graph data. Further, most commercial tools that do support graph data expect the entire

graph to be entirely in memory, which is a significant limitation for large data sets. As an example, the

US roads network has over 30,000,000 edges (road segments) and an approximately equal number of

nodes (intersections).

Spatial

Data regarding a location or graphical depiction (diagram or map) may be considered spatial data.

Spatial data generically consists of geometries (shapes) as an attribute (column) of a structured data set.

If the geometries represent physical locations on the Earth or another celestial body, those geometries

are considered geospatial. A geographic information system (GIS) is one form of repository for storing

geospatial data.

Coordinate Systems

Simple geometry data is based upon a standard Cartesian plane such as a drawing program (e.g.

Microsoft Visio or open source Dia) would support. If the geometries are geospatial, there may be a

planetary coordinate system that allows for accurate representation on an ellipsoidal model of the

planet. The coordinate systems of geospatial geometries are mathematically complex and distort

aspects of the geometries in specific ways depending upon the coordinate system selected. As data is

collected, it will exist in a single coordinate system.

As data is transformed into another coordinate system a distortion will occur. Any analysis performed

using data from different coordinate systems will contain a measure of error (or bias) based upon

artifacts introduced by the distortion.

Spatial Accuracy, Precision and Scale

The accuracy and precision of spatial data has an added aspect of vertex density. A geometry, such as a

line, is composed of multiple points which are connected by a line. This “chaining” of points (vertices)

allows for the generation of complex lines (polylines or arcs) and polygons. When created (digitized) the

spacing between the vertices may be significant. Each vertex has accuracy and precision measures as

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

does the space between the vertices. For example, a line consisting of three points may be accurate and

precise to within one inch at the vertices, with no measure of accuracy or precision for the linear

segments connecting the vertices.

Spatial scale is the ratio of “real world” distance to the depicted distance of a geometry. For geospatial

data, the coordinates stored are often “exact” within the precision of the data element (such as a 64-bit

floating point number). In terms of scale, the defined geospatial scale is the scale at which the data was

collected for use. This scale is an indication of the nominal distances between vertices to depict

variation in location (such as the curves in a stream). As the scale becomes more precise (tends to 1/1)

there will be more vertices per unit distance to indicate variation. As the scale becomes less precise

(tends to 1/∞) there will be fewer vertices per unit distance. As a result, coarse scale data will depict

less of the variation in a road or river as it meanders. Further, as the scale is coarser the accuracy of the

geometry will diverge from the physical world more between the vertices than at a finer scale. When

the spatial data is transformed to another coordinate system, this may cause great concern as the linear

segments between vertices cannot distort (curve) with the coordinate system unless interpolated

vertices are inserted. In the case of added vertices to introduce this curve, the original data is lost and

may appear more fine-scale than the source data actually was. This is a serious issue that must be

addressed for all spatial data.

Temporal

Data describing the real world is always valid in time. Time will be considered a one-dimensional space

that is infinite in both directions. Like spatial data, time is an attribute of other data structures such as

tabular data. Temporal data can be categorized into three primary classes; archival, historical and true

temporal.

Archival

Archival data is not truly a temporal aspect of data, but instead of data handling. Archival is the practice

of removing “old” data from the primary operational data store to conserve space and improve

performance. Archival may be ad hoc or scheduled. Archived data is stored separate (even if only

logically) from operational data and required additional processing to query. Archival data may be

coupled with temporal data for additional capability.

Historical

Historical data is a form of temporal data that persists data record changes over time. This is often

accomplished via a start date and end date for “old” records of a specific entity. This is commonly

coupled with auditing for transactional histories of data. The historical records may be queried into to

provide a level of detail on how an entity has changed over time. In historical data, the temporal aspect

of the history is secondary as all data is operated upon in the present sense and historic information is

merely available for viewing. History data may be archived to free space.

Temporal

Truly temporal data also contains the notion of a start and end date, but adds the concept of time as a

first-class data element. A temporal data entity may be edited or created in the present, past or future.

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

For example, a bridge may be planned for construction in the future and entered into the system for

that future time. If the bridges are queried in the present, the future bridge will not appear. Likewise, a

bridge may be scheduled for demolition and therefore be indicated as demolished in the future. If the

system is queried in the future, the bridge will not appear. Finally, after the bridge is destroyed, it may

be rebuilt further in the future. Once rebuilt, the original bridge record may be “resurrected” as existing

once again (and considered as the same bridge). This allows for a fully temporally aware data system

that can be acted upon in time. Temporal data may be archived to free space and may be coupled with

spatial data for full four-dimensional space and time awareness.

Data Formats All data stored within a computer is represented as files. Even relational databases store their data in

files. An important aspect for data management is the format of the data, both in terms of data models,

and file formats. A data model is the logical decomposition of an entity into the properties that are

required to describe that entity. These data models are then persisted in some manner within files. For

example, the Microsoft Word application saves documents in a “.doc” or “.docx” file, each of which have

a specific encoding of the data contained within the file.

File Formats Data file formats can be open or closed. An open format file is a well-known and well-documented

format such as a text file (regardless of content), an ESRI Shapefile or Autodesk “dxf” file. These formats

are well-defined and publicly published. Anyone can develop an application that can extract the

information from a file that conforms to these formats. A closed (proprietary) format is any file format

that is not well-known or well-documented. The Oracle “.ora” and MS Word “.doc” file formats are

examples of closed formats.

Stability File formats will change over time as new versions are released. In general, as we embrace change,

changes to file formats must follow. Changes in file formats, whether open or closed, will affect all data

currently persisted. The more frequently a format is revised by the format definers, the less stable the

format is. As data is stored in a specific version of a format, that data will become more difficult to

maintain as the format is revised over time.

Longevity Longevity is important to information strategies in four primary respects:

Data longevity

Storage longevity (management)

Format longevity (stability)

Media longevity

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Data

Data longevity relates to the period of time that a given piece of information is relevant and valid. In

general, data is perpetual. For a given data element captured in time, it is a stable representation of

that element at that point in time, for all time. As time progresses, the data captured in the past

maintains its representation of that data element at the point in time it was captured. If any form of

temporal change determination is reasonable, or for primary historical purposes, data must be

maintained in perpetuity. Only a small portion of raw information is truly transient in nature, whereas

many derived datasets are highly volatile in time.

When data longevity is considered, the applicable maintainable lifetime must be identified per data

element. For that time period, the data must be persisted and protected.

Storage

Storage longevity is the natural follow-up to data longevity, where the mechanisms for storage are

maintained. Storage longevity deals with the planning and maintenance of infrastructures and

applications for maintaining the data to be stored. Storage longevity includes hardware replacement

and archival plans for moving data to secondary stores.

An important consideration in storage longevity is failure mitigation and recovery. If any devices fail,

how will the data be restored to operational capability? Additionally, if a third party storage is used

(such as cloud storage), the storage longevity should include any considerations of inter-organization

practices and policies to ensure storage stability.

Format

Format longevity directly relates to the stability of selected storage formats. Further, format longevity

includes migration of data from one format to another as formats are upgraded, or as format selections

change (such as a migration from one RDBMS to another). Format longevity may be critical in terms of

affecting the chain of custody of the data. As a format is changed, it may alter the content (e.g. floating

point representations may be different, resulting in precision changes), or simply result in a “non-

original”, “official” copy of the content. This is a critical issue to the National Archives and Records

Administration (NARA).

Media

Media longevity deals with the effective lifespan of physical storage media. For example, the lifespan of

a compact disk (CD) is estimated at anywhere from 2-10 years or more

(http://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.html). The actual lifespan of a

physical media is most critical for offline or extended storage that is not used frequently. The active

data media will be replaced during storage maintenance, but media longevity is often unaddressed

except during long-term archival planning (again a critical NARA issue).

Data Volumes The volume of data collected per unit time for a specific data entity will drive much of the planning for

data storage. In high-volume data collection activities, long-term storage is only practical using low-cost

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

offline media such as DVD, tape or Blu-Ray. The longevity of this media then becomes an issue,

therefore the media may need to be tested and replicated periodically to ensure stability of the content.

Even for mid to low density data collection, if the data is temporal with a high change rate (such as

product inventory), accumulation of records may result in an overall higher than anticipated volume.

Finally, aggregation of multiple data entities can also result in large data volumes. Scaling information

solutions to handle large data volumes is a complex undertaking that is still in its infancy. Scaling may

include any of several approaches:

Separation, keep each data entity type in an isolated store

Sharding, keep natural subsets of a data entity type in an isolated store

Versioning, keep only changes to data entities over time

Archival, move “older” data offline to secondary stores

Evaporation, delete “out of date” data automatically

Higher storage densities are also possible using existing technologies such as compression, efficient

encoding (e.g. JPEG vs BMP for images) and simple removal of redundancy.

Separation Data separation encompasses any means of separating data in a natural manner by keeping unrelated

information separate from other information. Service oriented architectures (SOA) and cloud

computing in general strongly advocate data separation.

Separation may make applications more complex as each type of data is stored in isolation from other

data. When it is necessary to query across data entity types, multiple data stores must be queried and

all result sets related based upon commonality. Further, there are issues with data editing anomalies

that must be addressed.

Sharding The notion of sharding is related to that of separation, with the exception that a single data entity type

is separated into isolated stores based upon some property of the entity instances. For example, a

temperature store may be separated by location where each temperature monitoring site has its own

data store, but all use the same logical model. If temperature data is queried across all sites, then

multiple parallel queries must be executed and results unioned.

Sharding, like separation, may cause significant increases in complexity to applications consuming data

from a shared data store due to the increased query integration complexity.

Versioning The versioning of data is a form of temporal storage where only a single complete record is stored for a

data entity and all temporal changes are stored as “change sets” that only contain the information

elements that changed. In most circumstances, data formats drive the storage mechanisms used. Since

few data formats include the capability for versioning, there are few circumstances where versioning is

cost-effective to implement. In most cases, for temporal storage, entire copies of the data element are

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

stored for each temporal update made. This results in many copies of unchanged data being persisted

over time.

Versioning results in a performance cost in the retrieval of a specific change set. This performance cost

is a result of the integration of the changes (deltas) to the base record to arrive at the derived temporal

record. The base record may be optimized to be the most commonly retrieved temporal instance

(usually the most current record will be the completely stored base record, resulting in a reverse

temporal change set).

Archival Archival is the removal of historical records from the primary store to a secondary, higher density but

lower cost, store. The archival store may add performance to the retained data by reducing the size of

the store to be actively searched. Unfortunately, archival generally increases the complexity, cost and

time for querying into the archive.

Evaporation Evaporation (or purging) is the removal (deletion) of “old” data from the system. Evaporation is

generally an automated process that “expires” old records based upon some criteria and deletes those

records from the system. This may or may not be in conjunction with archival, where old archives are

destroyed. Evaporation cannot be undone, so this must be thoroughly planned to ensure no useful data

is lost.

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20100202-08

Conclusions Every organization must evaluate its information strategy and portfolio to ensure all information is

properly maintained for an effective lifecycle for all users, both intended and unintended. There are

many considerations for each form of data that comprises the organizational information corpus. Data

modeling is an activity which is critical to ensure the proper entities are captured in a repeatable,

standardized and maintainable manner.

Overall, data is the primary vehicle for transferring information which is used to derive knowledge. Well

designed and defined data will provide better information and may create opportunities for new,

unplanned uses of the data without a need for new data models. These emergent capabilities may be of

much greater societal impact than the original uses the data was modeled for.

Appendices

Acronym List

Acronym Description

DID Data Item Description

DoD Department of Defense

DSU Defense Systems Unit

GAO Government Accountability Office

NARA National Archives and Records Administration

References