Title: The Challenges of Integrating Structured and Unstructured Data

Title: The Challenges of Integrating Structured and Unstructured Data

Title of paper: Integrating Structured and Unstructured Data

Abstract All organizations are aware that a considerable amount of technical and business information and knowledge resides in both the structured data bases and in unstructured repositories (e.g. documents, emails, etc). Simply enabling independent searches of these does not produce the most value. Valuable conclusions are represented in reports that were developed from investigating structured data and these are lost from view when searching only databases. Together, they provide the facts and the conclusions. When the searches are combined, they produce a plethora of information. Usability and workflows are important design considerations to achieve clear, quick access to multiple disparate information sources. The power of combined access to integrated structured and unstructured data is becoming the focus of many international oil and gas companies. This paper describes the technical and usability issues of integrating disparate sources into a clear and single information environment.

Background The E&P scientific and technical business is information and data intensive. As the computerization of work has progressed, more and more of the information have migrated to electronic form. Initial data management focused on the workgroup. Multiple disciplines dealt with their own area of expertise. The initial efforts of management of data beyond the work groups concentrated on the shared structured data, such as well logs and seismic data. Interpretation results, such as horizons, picks and faults were shared next. Recognizing that project speed and quality would increase by sharing data, effort was made to define massive multidisciplinary data models. These persist today and stand alongside the active project data stores. Most of these large scale repositories are used to store the raw incoming information that is feeding the interpretation systems. At the same time, document management systems were growing. They started on the business side and migrated toward the scientific and technical by storing reports of interpretation. Over time, emails and the documents located on shared and personnel computer disks began to hold larger and larger volumes of important and valuable information. We did an informal web survey amongst a number of E&P data management staff and consultants on where they believed valuable information resides.

14th PNEC Conference 2010


Figure 1 - Informal Poll Results

Scientific and technical users believe that traditional repositories hold valuable information. But they also believe that the unstructured, ad hoc repositories of SharePoint and ‘wild files’ held nearly as much valuable information. It is understood by many people that most of the knowledge dialog that takes place in our organizations occurs in the email and casual documents that are written and exchanged on a daily basis. These are not stored in data bases, nor indexed in document management systems. They are commonly stored on shared drives, email servers and file backups. This information is usually sitting on individual PCs and desktop devices. Significant knowledge is stored in this mass of unstructured material. There are “questions and answers”, photos and diagrams, short statements and multipage reports. All are produced with expertise and effort. These repositories are becoming the new targets for mining valuable information in an organization. There are good reasons to include the unstructured information in an enterprise data management solution. These files contain interpretations, descriptions and decisions. The structured data stores contain mostly raw primary data, such as well logs and seismic traces. One could say that the unstructured data stores hold the intellectual capital and the structured data stores hold the valuable basic factual data. So integrating these two information sources into the Enterprise data management strategy makes a lot of sense.

Extracting information from unstructured data stores Increasingly capable technical solutions are enabling access to these attractive knowledge stores. Systems can access the email and documents. Systems can extract patterned information out of them, such as zip codes, addresses and telephone numbers. Search engines that operate in a ‘Google-like’ manner have been implemented and are enabling delivery to the desktop hundreds if not thousands of documents. Just indexing the documents and files can be a daunting task and is frequently a roadblock to success in these projects. Great taxonomies are built and then they languish on paper because the hurdle of classifying each document is too time consuming and costly. Fortunately new technologies are providing capabilities so that rules can be applied to apply standard taxonomies to the documents and electronic records. However, propagating this taxonomy across the enterprise including structured data requires a unique architecture.



Problems of Integrated Searches We and others have taken steps to include structured data records in these searches and added map displays of the location of all these information records. Now a key word search or a map search can bring to the desktop a massive number of documents, emails and database records. We now have a different problem. We have found information, now how to find meaning and value. How can we navigate through the massive piles of files? How can we be informed of information being developed right now? The emails never stop. The presentations do not get fewer. We rarely get rid of data, rather we obtain more. The tsunami continues! Capabilities to deliver to the desktop are outstripping the ability of the end users to extract value. This is becoming the most pressing of our challenges. Just getting information to the desktop does not streamline decision-making, isolate the most relevant information for your problem nor deliver insights that are timely.

Search Results Problems and Information Overload We and others have taken steps to include structured data records in these searches and added map displays of the location of all these information records. We are faced with a number of issues when working with both structured and unstructured sources. They are presented as different media. The search cycle that we use may be different at different times in our work efforts. We will get many results from the search and are not assured that we are getting the most current or relevant.

Mitigation Strategies The problems faced in the enterprise integration of structured and unstructured data require multiple approaches to resolution. Some are focused on understanding the search patterns, others on deciding what is included, others on aggregation techniques and still others on the presentation mechanisms. We will describe these individually below.

Understanding Search Patterns We find that enterprise search is not the same for all people and not the same for people who are in different stages of a project or operation



Query Life Cycle Stages

Project Initiation

Seeking

Finding

Filtering

Displaying/Analyzing

Within Project

Return to found set

Displaying

Deciding

Assembling

Concluding Project

Deciding

Assembling

Presenting

Archiving

Figure 2 - Searching Lifecycles

During the life of a project the functions needed by a user change from stage to stage. Each function can be optimized for each episode. For example, “Seeking” can be enriched by adding both attribute and spatial selection functions. “Returning to a Found Set” can be assisted by enabling a rapid return to the previously selected information. But also can include alerts about data that came into the system after the original search. This new data should be highlighted in some manner so that it can be recognized as ‘new’ to the end user. “Assembling” can be enhanced by the use of a ‘collection’ attribute that crosses data types and sources. This enables completeness criteria to be applied to insure that all the necessary data is available for ‘Presentation’. Lastly, the value of “Archiving” can be enhanced by including collection and completeness attributes along with the identity of the user and a record of the heritage of the information. By examining the needs of each functional step and considering the fore and aft operations, the data enterprise data management can produce the maximum value for the organization.

Search Results Integration Integration of search results is key to sucess of enterprise data management. This high level diagram shows how structured data items, map displays and indexed unstructured data are brought together in a portal environment for the end users. User interface designs were made so that the user could navigate with free text search, taxonomy filters and map-based selection to obtain the data of interest.



PetrisWINDS EnterpriseStructured Data Access

Microsoft SharePointPetrisWINDS OneTouch

MicrosoftFAST

ESRI GIS

Specialized AppsSpecialized

AppsSpecialized AppsSpecialized

AppsBest of BreedApps

Search,Collaborate,Workflow

BusinessIntelligence

IndividualTechnical

Users

Work teams andTechnical Users

Analysts, Executivesand Casual Users

Specialized AppsSpecialized

AppsSpecialized AppsSpecialized

AppsUn-

structured files

Data Capture,Analysis,

Store & Report

Figure 3 - High Level Integration Architecture

The user experience and screen layout is shown in figure 4. When the curser is hovering over the data item, its location on the map blinks. This is designed to accentuate the relationships between map positions and the information items found as documents or database records in the list.

Figure 4 - User View of Unstructured and Structured Information in Enterprise Information Portal

The thumbnails identify geo-referenced documents that are listed in the center panel. The left taxonomy allows filtering of the found set. The documents can be read from the links provided and structured digital data can be viewed using plug-in viewers. The check boxes have been placed by each unstructured document so that the selected files of images and documents can be placed in a more convenient Image Navigation Tool, shown in figure 5.

Image Navigation Tools The Image Navigation Tool provides users a familiar user paradigm to “thumb” through a set of documents and images in a fast and efficient manner to select data that they would like to examine in more detail. This is another tool to deal with the hundreds or thousands of items that a search can return from an enterprise information system.



Figure 5 - Document Navigation Tool

Focused Indexing When faced with the whole enterprise full of unstructured files, it may seem hard to know where to start. We have been fortunate to have had had access to the usage patterns of an early online document indexing system. These were the AAPG Bulletins and Special Publications. This scientific and technical information was updated monthly and aggregated for the many years of publication. In order to assess system loading, we asked the questions about the pattern of use from the technical public. We learned two important things and have brought that insight to our enterprise search approach. We found that the Bulletins and the Special Publications had different access patterns. Interest was lost in the monthly bulletins as time went on. The Special Publications, compendiums of papers on single topics mostly retained their interest to the community even decades old.

Online Access Patterns to Bulletins

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1900 1920 1940 1960 1980 2000 2020

Year

Num

ber D

ownl

oade

d

Online Access Patterns to SpecPubs

0

50

100

150

200

250

300

350

400

1910

1920

1930

1940

1950

1960

1970

1980

1990

2000

2010

Year

Num

ber D

ownl

oade

d

What we learned was that the knowledge intensive and focused topic documents and newer documents seemed to provide the highest value to the scientists that made the effort to download the articles. This gave us an ability to prioritize our indexing and even weight the value of certain types of documents when presented to a user.



Relationship Mapping A number of visualization tools are emerging that can provide some unique and even useful representations of the information found from an enterprise search. A web site worth visiting is “Many Eyes” from IBM. I have generated a few navigation diagrams from information found in our enterprise search demonstration. The first diagram is a relationship network between various E&P data types. This diagram can be used to summarize the content of data types that have been found.

Figure 6 - Data Type Relationship Diagram

The next diagram is a Word Tree diagram. This is a visual search tool that lets you pick a word or phrase and shows you the different contexts that it appears. The contexts are arranged in branching structures that show recurrent associations in the text. In figure 7, we see that for lease 00063 well data is available with core summaries and plugs. The highlight bar allows the user to zoom to a complete phrase or sentence that may be of interest.

Figure 7 - Word Tree Diagram



The last visualization is a bit stylistic, but remarkably useful. This is a Word Cloud that lets you see how frequently word appears in a given text. This can give a quick view of the contents of a report that has been found in an enterprise search.

Figure 8 - Word Cloud of a Technical Document

Central Catalog/Index Key to the propagation of a consistent taxonomy is the role of a central catalog that is synchronized with the federated data sources across the enterprise. It allows the organization to utilize the efforts invested in constructing their taxonomy. It also insures speedy query responses, easy maintenance of the taxonomy and reduces application license costs when searching across the enterprise. Central catalog architecture is at the heart of the fast web search engines such as Google and Yahoo! and they deal with massive amounts of information also.

Socialization of Data The socialization of data brings the concepts of Twitter, Facebook and Alerts that are available to the commercial users. We are introducing subscriptions and comment capture in our enterprise data management solutions. The first step is the subscription step and Figure 9 is an example of a subscription management user interface from a Seismic data management solution.



Figure 9 - Subscription Management User Interface

There are a variety of subscriptions that a user may wish to subscribe to. We described situation in one of the Query Lifecycles where a returning user would like to know whether new data is available. As can be seen here a rich array of conditions can be specified for messages to be sent. Some are data processing “state changes”, others are availability conditions and others are related to comments added by others on the information being monitored. Notification can be implemented within the data management solution, via telecommunication portals such as SMS feeds or in emails that will arrive at the desk of the user. Figure 10 shows a consolidated summary of notification information that this user has subscribed to.



Figure 10 - Email Notification Summarizing Subscription

Summary With the success of being able to access nearly all the information in your organization, there is a real danger of overwhelming the user with a tsunami of data. The value of this information is real and will make a significant contribution to the expert decision-making required in the E&P business. Instead of being overwhelmed by this information, we can use wise indexing and document weighting. We can provide speedy return of results and use multiple filters, visualizations and social interact with our information and associates. This will allow enterprise data management to achieve new levels of productivity, collaboration and sound decision-making for our organizations.


Title: The Challenges of Integrating Structured and Unstructured Data

Documents

Transcript of Title: The Challenges of Integrating Structured and Unstructured Data