The Challenges of Integrating Structured and Unstructured Data

12
LANDMARK TECHNICAL PAPER The Challenges of Integrating Structured and Unstructured Data By Jeffrey W. Pferd, PhD, Sr. Vice President Strategic Consulting Practice at Petris Presented at the 14th Petroleum Network Education Conference (PNEC), 2010

Transcript of The Challenges of Integrating Structured and Unstructured Data

LANDMARK TECHNICAL PAPER 1

LANDMARK TECHNICAL PAPER

The Challenges of Integrating Structured and Unstructured Data

By Jeffrey W. Pferd, PhD, Sr. Vice President Strategic Consulting Practice at Petris Presented at the 14th Petroleum Network Education Conference (PNEC), 2010

LANDMARK TECHNICAL PAPER 1

The Challenges of Integrating Structured and Unstructured Data

By Jeffrey W. Pferd, PhD, Sr. Vice President Strategic Consulting Practice at Petris1

IntroductionAll organizations are aware that a considerable amount of technical and business information and knowledge resides in both the structured data bases and in unstructured repositories (e.g., documents, emails, etc.). Simply enabling independent searches of these does not produce the most value. Valuable conclusions are represented in reports that were developed from investigating structured data, and these are lost from view when searching only databases. Together, they provide the facts and the conclusions.

When the searches are combined, they produce a plethora of information. Usability and workflows are important design considerations to achieve clear, quick access to multiple disparate information sources. The power of combined access to integrated structured and unstructured data is becoming the focus of many international oil and gas companies.

This paper describes the technical and usability issues of integrating disparate sources into a clear and single information environment.

Background The E&P scientific and technical business is information and data intensive. As the computerization of work has progressed, more and more of the information has migrated to electronic form. Initial data management focused on the workgroup. Multiple disciplines dealt with their own area of expertise. The initial efforts of management of data beyond the work groups concentrated on the shared structured data, such as well logs and seismic data. Interpretation results, such as horizons, picks, and faults were shared next. Recognizing that project speed and quality would increase by sharing data, effort was made to define massive multidisciplinary data models. These persist today and stand alongside the active project data stores. Most of these large-scale repositories are used to store the raw incoming information that is feeding the interpretation systems.

At the same time, document management systems were growing. They started on the business side and migrated toward the scientific and technical by storing reports of interpretation. Over time, emails and the documents located on shared and personal computer disks began to hold larger and larger volumes of important and valuable information.

We did an informal web survey amongst a number of E&P data management staff and consultants on where they believed valuable information resides.

1 Landmark Software & Services acquired Petris in 2012 at which time the author of this paper joined as Sr. Technical Advisor for Information Management.

LANDMARK TECHNICAL PAPER 2

Scientific and technical users believe that traditional repositories hold valuable information. But they also believe that the unstructured, ad hoc repositories of SharePoint and ‘wild files’ held nearly as much valuable information.

It is understood by many people that most of the knowledge dialog that takes place in our organizations occurs in the email and casual documents that are written and exchanged on a daily basis. These are not stored in data bases, nor indexed in document management systems. They are commonly stored on shared drives, email servers and file backups. This information is usually sitting on individual PCs and desktop devices. Significant knowledge is stored in this mass of unstructured material. There are “questions and answers,” photos and diagrams, short statements and multipage reports. All are produced with expertise and effort.

These repositories are becoming the new targets for mining valuable information in an organization. There are good reasons to include the unstructured information in an enterprise data management solution. These files contain interpretations, descriptions and decisions. The structured data stores contain mostly raw primary data, such as well logs and seismic traces. One could say that the unstructured data stores hold the intellectual capital and the structured data stores hold the valuable, basic factual data. So integrating these two information sources into the Enterprise data management strategy makes a lot of sense.

Extracting Information from Unstructured Data StoresIncreasingly capable technical solutions are enabling access to these attractive knowledge stores. Systems can access the email and documents. Systems can extract patterned information out of them, such as zip codes, addresses and telephone numbers. Search engines that operate in a ‘Google-like’ manner have been implemented and are enabling delivery to the desktop hundreds if not thousands of documents.

Just indexing the documents and files can be a daunting task and is frequently a roadblock to success in these projects. Great taxonomies are built and then they languish on paper because the hurdle of classifying each document is too time consuming and costly. Fortunately, new technologies are providing capabilities so that rules can be applied to apply standard taxonomies to the documents and electronic records. However, propagating this taxonomy across the enterprise, including structured data, requires a unique architecture.

100%

80%

60%

40%

20%

0%

Where do you think dthe information with the biggest value for your organization hides?

Applicaton Databases

42%

17%

2% 7%

30%

DCM (Document, FileNET, etc)

Corporate Databases SharePoint and the like

“Wild Files” S-drives, etc

Figure 1 - Informal Poll Results

LANDMARK TECHNICAL PAPER 3

Problems of Integrated SearchesWe and others have taken steps to include structured data records in these searches and added map displays of the location of all these information records. Now a key word search or a map search can bring to the desktop a massive number of documents, emails and database records. We now have a different problem. We have found information, now how to find meaning and value. How can we navigate through the massive piles of files? How can we be informed of information being developed right now? The emails never stop. The presentations do not get fewer. We rarely get rid of data, rather we obtain more. The tsunami continues!

Capabilities to deliver to the desktop are outstripping the ability of the end users to extract value. This is becoming the most pressing of our challenges. Just getting information to the desktop does not streamline decision-making, isolate the most relevant information for your problem, nor deliver insights that are timely.

Search Results, Problems, and Information OverloadWe and others have taken steps to include structured data records in these searches and added map displays of the location of all these information records. We are faced with a number of issues when working with both structured and unstructured sources. They are presented as different media. The search cycle that we use may be different at different times in our work efforts. We will get many results from the search and are not assured that we are getting the most current or relevant.

Mitigation StrategiesThe problems faced in the enterprise integration of structured and unstructured data require multiple approaches to resolution. Some are focused on understanding the search patterns, others on deciding what is included, others on aggregation techniques and still others on the presentation mechanisms. We will describe these individually below.

Understanding Search PatternsWe find that enterprise search is not the same for all people and not the same for people who are in different stages of a project or operation.

Figure 2 - Searching Lifecycles

Query Life Cycle Stages

Project Initiation

Seeking

Finding

Filtering

Displaying/Analylzing

Within Project

Return to found set

Displaying

Deciding

Assembling

Concluding Project

Deciding

Assembling

Presenting

Archiving

LANDMARK TECHNICAL PAPER 4

During the life of a project, the functions needed by a user change from stage to stage. Each function can be optimized for each episode. For example, “Seeking” can be enriched by adding both attribute and spatial selection functions. “Returning to a Found Set” can be assisted by enabling a rapid return to the previously selected information, but also can include alerts about data that came into the system after the original search. This new data should be highlighted in some manner so that it can be recognized as ‘new’ to the end user. “Assembling” can be enhanced by the use of a ‘collection’ attribute that crosses data types and sources. This enables completeness criteria to be applied to ensure that all the necessary data is available for ‘Presentation.’ Lastly, the value of “Archiving” can be enhanced by including collection and completeness attributes along with the identity of the user and a record of the heritage of the information.

By examining the needs of each functional step and considering the fore and aft operations, the data enterprise data management can produce the maximum value for the organization.

Search Results IntegrationIntegration of search results is key to sucess of enterprise data management. This high-level diagram shows how structured data items, map displays and indexed unstructured data are brought together in a portal environment for the end users. User interface designs were made so that the user could navigate with free text search, taxonomy filters and map-based selection to obtain the data of interest.

Figure 3 - High-Level Integration Architecture

The user experience and screen layout is shown in Figure 4. When the curser is hovering over the data item, its location on the map blinks. This is designed to accentuate the relationships between map positions and the information items found as documents or database records in the list.

PetrisWINDSEnterpriseStructured

Data Access

ESRIGIS

Microsoft SharePoint PetrisWINDS OneTouch

MicrosoftFAST

DS

BusinessIntelligence

Search,Collaborate,Workflow

Data Capture,Analysis,Store & Report

Analysts, Executivesand Casual Users

Work teams and Technical Users

Individual Technical UsersSpecialized

AppsSSSSSSSSSSSpppppppppppppeeeeeeeeeeeecccccccccccciiiiiiiiiiiiiiaaaaaaaaaaaalllllllllllllliiiiiiiiiiiiiizzzzzzzzzzzzeeeeeeeeeeeedddddddddddddd

AppsSpecialized Apps

Best of Breed Apps

Specialized Apps

SSSSSSSSSSSpppppppppppppeeeeeeeeeeeecccccccccccciiiiiiiiiiiiiiaaaaaaaaaaaalllllllllllllliiiiiiiiiiiiiizzzzzzzzzzzzeeeeeeeeeeeedddddddddddddd AppsSpecialized

Apps

Unstructured Files

LANDMARK TECHNICAL PAPER 5

Figure 4 - User View of Unstructured and Structured Information in Enterprise Information Portal

The thumbnails identify geo-referenced documents that are listed in the center panel. The left taxonomy allows filtering of the found set. The documents can be read from the links provided and structured digital data can be viewed using plug-in viewers. The check boxes have been placed by each unstructured document so that the selected files of images and documents can be placed in a more convenient Image Navigation Tool, shown in Figure 5.

Image Navigation ToolsThe Image Navigation Tool provides users a familiar user paradigm to “thumb” through a set of documents and images in a fast and efficient manner to select data that they would like to examine in more detail. This is another tool to deal with the hundreds or thousands of items that a search can return from an enterprise information system.

Figure 5 - Document Navigation Tool

LANDMARK TECHNICAL PAPER 6

Focused IndexingWhen faced with the whole enterprise full of unstructured files, it may seem hard to know where to start. We have been fortunate to have had had access to the usage patterns of an early online document indexing system. These were the AAPG Bulletins and Special Publications. This scientific and technical information was updated monthly and aggregated for the many years of publication. In order to assess system loading, we asked the questions about the pattern of use from the technical public. We learned two important things and have brought that insight to our enterprise search approach.

We found that the Bulletins and the Special Publications had different access patterns. Interest was lost in the monthly bulletins as time went on. The Special Publications–compendiums of papers on single topics–mostly retained their interest to the community even when the data was decades old

What we learned was that the knowledge intensive and focused topic documents and newer documents seemed to provide the highest value to the scientists that made the effort to download the articles. This gave us an ability to prioritize our indexing and even weight the value of certain types of documents when presented to a user.

Relationship MappingA number of visualization tools are emerging that can provide some unique and even useful representations of the information found from an enterprise search. A Website worth visiting is “Many Eyes” from IBM. We have generated a few navigation diagrams from information found in our enterprise search demonstration. The first diagram is a relationship network between various E&P data types. This diagram can be used to summarize the content of data types that have been found.

400

350

300

250

200

150

100

50

0

1910

1920

1930

1940

1950

1960

1970

1980

1990

2000

2010

Year

Num

ber d

ownl

oade

d

Online Access Patterns to SpecPubs Online Access Patterns to Bulletins

200018001600140012001000800600400200

0

Num

ber d

ownl

oade

d

Year 1900 1920 1940 1960 1980 2000 2020

Casing Segments

Casing StringDaily Production

Owners

Shot Points

Lease

Seismic LineField

WellMonthly Production

Dirdectional Survey

Logs

ReportsAttribute Values

Curves

Authors

Depth Values

Figure 6 - Data Type Relationship Diagram

LANDMARK TECHNICAL PAPER 7

The next diagram is a Word Tree diagram. This is a visual search tool that lets you pick a word or phrase and shows you the different contexts in which it appears. The contexts are arranged in branching structures that show recurrent associations in the text. In Figure 7, we see that for lease 00063 well, data is available with core summaries and plugs. The highlight bar allows the user to zoom to a complete phrase or sentence that may be of interest.

Figure 7 - Word Tree Diagram

The last visualization is a bit stylistic, but remarkably useful. This is a Word Cloud that lets you see how frequently words appear in a given text. This can give a quick view of the contents of a report that has been found in an enterprise search.

Figure 8 - Word Cloud of a Technical Document

Central Catalog/Index Key to the propagation of a consistent taxonomy is the role of a central catalog that is synchronized with the federated data sources across the enterprise. It allows the organization to utilize the efforts invested in constructing their taxonomy. It also ensures speedy query responses and easy maintenance of the taxonomy, and reduces application license costs when searching across the enterprise. Central catalog architecture is at the heart of the fast Web search engines such as Google and Yahoo!, and they deal with massive amounts of information also.

LANDMARK TECHNICAL PAPER 8

Socialization of Data The socialization of data brings the concepts of Twitter, Facebook and Alerts that are available to the commercial users. We are introducing subscriptions and comment capture in our enterprise data management solutions. The first step is the subscription step, and Figure 9 is an example of a subscription management user interface from a seismic data management solution.

Figure 9 - Subscription Management User Interface

There are a variety of subscriptions that a user may wish to subscribe to. We described a situation in one of the Query Lifecycles where a returning user would like to know whether new data is available. As can be seen here, a rich array of conditions can be specified for messages to be sent. Some are data processing “state changes”, others are availability conditions, and still others are related to comments added by other users on the information being monitored.

Notification can be implemented within the data management solution, via telecommunication portals such as SMS feeds or in emails that will arrive at the desk of the user. Figure 10 shows a consolidated summary of notification information that this user has subscribed to.

LANDMARK TECHNICAL PAPER 9

Figure 10 - Email Notification Summarizing Subscription

Summary With the success of being able to access nearly all the information in your organization, there is a real danger of overwhelming the user with a tsunami of data. The value of this information is real and will make a significant contribution to the expert decision-making required in the E&P business.

Instead of being overwhelmed by this information, we can use wise indexing and document weighting. We can provide speedy return of results and use multiple filters, visualizations and social interact with our information and associates. This will allow enterprise data management to achieve new levels of productivity, collaboration and sound decision-making for our organizations.

www.halliburton.com

© 2013 Halliburton. All rights reserved. Sales of Halliburton products and services will be in accord solely with the terms and conditions contained in the contract between Halliburton and the customer that is applicable to the sale. H010419 2013