Data Warehousing Fundamentals - Sigarra

406
....................................................................................... 50102GC20 Production 2.0 May 1999 M08762 Data Warehousing Fundamentals Volume 2 • Student Guide

Transcript of Data Warehousing Fundamentals - Sigarra

.......................................................................................

50102GC20

Production 2.0

May 1999

M08762

Data Warehousing Fundamentals

Volume 2 • Student Guide

Authors

Chon S. Chua

Richard Green

Technical Contributors and Reviewers

Jackie Collins

Jennifer Jacoby

Mike Schmitz

John Haydu

Russ Pitts

Lauran Serhal

Brian Pottle

Donna Corrigan

Patricia Moll

Harry Penbert

SuiWah Chan

Joel Barkin

Steve Dressler

Publisher

Tony McGettigan

Copyright Oracle Corporation, 1999. All rights reserved.

This documentation contains proprietary information of Oracle Corporation. It isprovided under a license agreement containing restrictions on use and disclosureand is also protected by copyright law. Reverse engineering of the software isprohibited. If this documentation is delivered to a U.S. Government Agency of theDepartment of Defense, then it is delivered with Restricted Rights and thefollowing legend is applicable:

Restricted Rights LegendUse, duplication or disclosure by the Government is subject to restrictions forcommercial computer software and shall be deemed to be Restricted Rightssoftware under Federal law, as set forth in subparagraph (c) (1) (ii) of DFARS252.227-7013, Rights in Technical Data and Computer Software (October 1988).

This material or any portion of it may not be copied in any form or by any meanswithout the express prior written permission of Oracle Corporation. Any othercopying is a violation of copyright law and may result in civil and/or criminalpenalties.

If this documentation is delivered to a U.S. Government Agency not within theDepartment of Defense, then it is delivered with “Restricted Rights,” as defined inFAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).

The information in this document is subject to change without notice. If you findany problems in the documentation, please report them in writing to EducationProducts, Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores,CA 94065. Oracle Corporation does not warrant that this document is error-free.

Data Warehouse Method—A Methodology for Designing Data Warehouse,SQL*Loader, PL/SQL, Pro*C, Oracle7, Oracle8, and Oracle8i, Distributed Option,Parallel Query Option, Parallel Server Option, Media Server, Spatial Data Option,ConText Option, Video Server, Text Server, WebServer, Oracle Universal ServerROLAP Option, Express Server, Web-enabled Express Server, SQL*Net,Developer/2000, Relational Access Manager, Discoverer, Designer/2000,SQL*Bridge, Transparent Gateway Developer’s Kit, Procedural GatewayDeveloper’s Kit, Express, Express Analyzer, Express Objects, Sales Analyzer,and Financial Analyzer are product names, trademarks, or registered trademarksof Oracle Corporation.

All other products or company names are used for identification purposes onlyand may be trademarks of their respective owners.

.....................................................................................................................................................Data Warehousing Fundamentals iii

.....................................................................................................................................................Contents

PrefaceProfile xi

Related Publications xiv

Typographic Conventions xv

Lesson 1: IntroductionCourse Objectives 1-3

Agenda 1-5

Questions About You 1-9

Lesson 2: Meeting a Business NeedOverview 2-3

Unsuitability of OLTP Systems for Complex Analysis 2-5

Management Information Systems and Decision Support 2-7

Data Extract Processing 2-9

Business Drivers for Data Warehouses 2-15

Current Situation and Growth of Data Warehousing 2-19

Typical Uses of a Data Warehouse 2-21

Summary 2-23

Practice 2-1 2-25

Lesson 3: Defining Data Warehouse Concepts and TerminologyOverview 3-3

Data Warehouse Definition 3-5

Data Warehouse Properties 3-7

Data Warehouse Terminology 3-21

Components of a Data Warehouse 3-25

Oracle Warehouse Vision, Products, and Services 3-31

Summary 3-41

Practice 3-1 3-43

Lesson 4: Driving Implementation Through a MethodologyOverview 4-3

Warehouse Development Approaches 4-5

The Need for an Iterative and Incremental Methodology 4-13

.....................................................................................................................................................iv Data Warehousing Fundamentals

.....................................................................................................................................................Contents

Oracle Data Warehouse Method 4-15

DWM Fundamental Elements 4-19

Oracle Warehouse Technology Initiative (WTI) 4-57

Summary 4-61

Practice 4-1 4-63

Lesson 5: Planning for a Successful WarehouseOverview 5-3

Managing Financial Issues 5-5

Obtaining Business Commitment 5-9

Managing a Warehouse Project 5-15

Identifying Planning Phases 5-29

Identifying Warehouse Strategy Phase Deliverables 5-31

Identifying Project Scope Phase Deliverables 5-35

Summary 5-41

Practice 5-1 5-43

Lesson 6: Analyzing User Query NeedsOverview 6-3

Types of Users 6-5

Gathering User Requirements 6-7

Managing User Data Access 6-9

Security 6-21

OLAP 6-25

Query Access Architectures 6-47

Summary 6-51

Practice 6-1 6-53

Lesson 7: Modeling the Data WarehouseOverview 7-3

Data Warehouse Database Design Phases 7-5

Phase One: Defining the Business Model 7-7

Phase Two: Creating the Dimensional Model 7-17

Data Modeling Tools 7-39

.....................................................................................................................................................Data Warehousing Fundamentals v

.....................................................................................................................................................Contents

Summary 7-41

Practice 7-1 7-43

Lesson 8: Choosing a Computing ArchitectureOverview 8-3

Architecture Requirements 8-5

The Hardware Architecture 8-7

Database Server Requirements 8-29

Parallel Processing 8-33

Summary 8-39

Practice 8-1 8-41

Lesson 9: Planning Warehouse StorageOverview 9-3

The Server Data Architecture 9-5

Protecting the Database 9-17

Summary 9-27

Practice 9-1 9-29

Lesson 10: Building the WarehouseOverview 10-3

Extracting, Transforming, and Transporting Data 10-5

Extracting Data 10-13

Examining Data Sources 10-15

Extraction Techniques 10-23

Extraction Tools 10-35

Summary 10-39

Practice 10-1 10-41

Lesson 11: Transforming DataOverview 11-3

Importance of Data Quality 11-5

Transformation 11-13

Transforming Data: Problems and Solutions 11-17

Transformation Techniques 11-33

.....................................................................................................................................................vi Data Warehousing Fundamentals

.....................................................................................................................................................Contents

Transformation Tools 11-53

Summary 11-57

Practice 11-1 11-59

Lesson 12: Transportation: Loading Warehouse DataOverview 12-3

Transporting Data into the Warehouse 12-5

Building the Transportation Process 12-11

Transporting the Data 12-15

Postprocessing of Loaded Data 12-25

Summary 12-39

Practice 12-1 12-41

Lesson 13: Transportation: Refreshing Warehouse DataOverview 13-3

Capturing Changed Data 13-5

Limitations of Methods for Applying Changes 13-25

Purging and Archiving Data 13-33

Final Tasks 13-39

Selecting ETT Tools 13-43

Summary 13-51

Practice 13-1 13-53

Lesson 14: Leaving a Metadata TrailOverview 14-3

Defining Warehouse Metadata 14-5

Developing a Metadata Strategy 14-11

Examining Types of Metadata 14-19

Metadata Management Tools 14-33

Common Warehouse Metadata 14-35

Summary 14-37

Practice 14-1 14-39

Lesson 15: Supporting End-User AccessOverview 15-3

.....................................................................................................................................................Data Warehousing Fundamentals vii

.....................................................................................................................................................Contents

Business Intelligence 15-5

Multidimensional Query Techniques 15-7

Categories of Business Intelligence Tools 15-9

Data Mining in a Warehouse Environment 15-19

Oracle Data Mining Partners 15-33

Summary 15-35

Practice 15-1 15-37

Lesson 16: Web-Enabling the WarehouseOverview 16-3

Accessing the Warehouse Over the Web 16-5

Common Web Data Warehouse Architecture 16-9

Issues in Deploying a Data Warehouse on the Web 16-11

Evaluating Web-Based Tools 16-19

Summary 16-23

Practice 16-1 16-25

Lesson 17: Managing the Data WarehouseOverview 17-3

Managing the Transition to Production 17-5

Managing Growth 17-19

Managing Backup and Recovery 17-33

Identifying Data Warehouse Performance Issues 17-45

Summary 17-51

Appendix A: Practice SolutionsPractice 2-1 A-2

Practice 3-1 A-4

Practice 4-1 A-7

Practice 5-1 A-11

Practice 6-1 A-12

Practice 7-1 A-13

Practice 8-1 A-14

Practice 9-1 A-15

.....................................................................................................................................................viii Data Warehousing Fundamentals

.....................................................................................................................................................Contents

Practice 10-1 A-18

Practice 11-1 A-20

Practice 12-1 A-21

Practice 13-1 A-23

Practice 14-1 A-24

Practice 15-1 A-26

Practice 16-1 A-28

Glossary

.................................

10

Building the Warehouse

.....................................................................................................................................................10-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Overview

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building

theWarehouse)

ETT(Building

theWarehouse)

Meeting aBusiness

Need

SupportingEnd UserAccess

Managing the Data

Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Objectives

After completing this lesson, you should be able todo the following:

• Outline the extraction, transformation, andtransportation processes for building a datawarehouse

• Identify extraction issues

• Explain how to examine data sources

• Identify extraction techniques

• List tools that can be used to extract data fromsources

.....................................................................................................................................................Data Warehousing Fundamentals 10-3

.....................................................................................................................................................Overview

OverviewIn this lesson, you explore the sources of data for the data warehouse data. You consider how the extraction and transformation processes take data from source systems and change it into data that is acceptable to the users of the data warehouse. The lesson also describes typical data anomalies and looks at ways to eliminate them.

Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page.

ObjectivesAfter completing this lesson, you should be able to do the following:

• Outline the extraction, transformation, and transportation processes for building a data warehouse.

• Identify extraction issues.

• Explain how to examine data sources.

• Identify extraction techniques.

• List tools that can be used to extract data from sources.

.....................................................................................................................................................10-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Extraction/Transformation/TransportationProcesses (ETT)

• Extract source data

• Transform/clean data

• Index and summarize

• Load data into WH

• Detect changes

• Refresh data

Programs

Tools

ETT

Operationalsystems

Warehouse

Browser:http://HollywoodHollywood XX + +

Customers:

a re

coro

f

a

s

XX + +

Customers:

Browser:http://Hol lywoodHol lywood

Browser:http://HollywoodHollywood

XX + +

Gateways

.....................................................................................................................................................Data Warehousing Fundamentals 10-5

.....................................................................................................................................................Extracting, Transforming, and Transporting Data

Extracting, Transforming, and Transporting Data

Extraction, Transformation, and Transportation TasksBefore considering this lesson’s focus on extraction, you should be aware that extraction, transformation, and transportation (sometimes called ETT) describes the series of processes that:

• Extract data from source systems

• Transform and clean up the data

• Index the data

• Summarize the data

• Load data into the warehouse

• Detect the changes made to source data required for the warehouse

• Restructure keys

• Maintain the metadata

• Refresh the warehouse with updated data

You can use custom programming, gateways between database systems, and internally developed tools or vendor tools to carry out the ETT processes.

.....................................................................................................................................................10-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

ETT Processes

• Must result in data that is relevant, useful, high-quality, accurate, and accessible

• Require a large proportion of warehousedevelopment time and resources

WarehouseOperationalsystems

Relevant

Clean up

Consolidate

Restructure

ETT

Useful

Quality

Accurate

Accessible

.....................................................................................................................................................Data Warehousing Fundamentals 10-7

.....................................................................................................................................................Extracting, Transforming, and Transporting Data

ETT Processes

ETT Importance The extraction, transformation, and transportation processes are absolutely fundamental in ensuring that the data resident in the warehouse is:

• Relevant and useful to the business users

• High quality

• Accurate

• Easy to access so that the warehouse is used efficiently and effectively by the business users

ETT Cost Building the ETT process is potentially one of the biggest tasks of building a warehouse; it is complex and time-consuming. In some implementations, it can take more than half of the total warehouse implementation effort.

Note: Extraction is covered by this lesson; transformation and transportation are considered in the next two lessons.

.....................................................................................................................................................10-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Data Staging Area

• The construction site for the warehouse

• Required by most implementations

• Composed of ODS, flat files, or relational servertables

• Frequently configured as multitier staging

Extract

Transform

Operationalsystem Transport

(Load)

WarehouseData

stagingarea

.....................................................................................................................................................Data Warehousing Fundamentals 10-9

.....................................................................................................................................................Extracting, Transforming, and Transporting Data

The Data Staging AreaRalph Kimball is one of the most widely recognized experts in the field of data warehousing. Kimball calls the data staging area the construction site for the warehouse. This is where much of the data transformation and cleansing takes place.

A staging area is a typical requirement of warehouse implementations. It may be an operational data store environment, a set of flat files, a series of tables in a relational database server, or proprietary data structures used by data staging tools.

You may employ multitier staging that reconciles data before and after the transformation process and before data is loaded into the warehouse. As many as three tiers are possible, from the operational server to the staging area and then to the warehouse server.

Note: Some ETT tools stage data internally and do not require a separate staging area.

If you are using the Oracle server and in-house developed tools, data is typically transformed after it is bulk-loaded (using SQL*Loader) into the staging area—the database tables. PL/SQL is often used to transform the data. You may also use gateways and replication techniques.

.....................................................................................................................................................10-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Remote Staging Model

Data staging area within the warehouse environment

Extract,transform,transport

Transform

Operationalsystem Transport

(Load)

Datastaging

area Warehouse

Warehouse environmentOper. envt.

Data staging area in its own environment, avoidingnegative impact on the warehouse environment

Extract,transform,transport

Transform

Operationalsystem Transport

(Load)

Datastaging

areaWarehouse

Staging envt.Oper. envt. Warehouse envt.

Copyright Oracle Corporation, 1999. All rights reserved.

Onsite Staging Model

Extract

Transform

Operationalsystem Transport

(Load)

Datastaging

area Warehouse

Operational environment WH envt.

Data staging area within the operational environment,possibly affecting the operational system

.....................................................................................................................................................Data Warehousing Fundamentals 10-11

.....................................................................................................................................................Extracting, Transforming, and Transporting Data

Possible Staging Models

Choosing a Model The model you choose depends upon operational and warehouse requirements, system availability, connectivity bandwidth, gateway access, and volume of data to be moved or transformed.

Remote Staging Model You may choose to extract the data from the operational environment and transport it into the warehouse environment for transformation processing. You may optionally execute some transformation processing during the extraction and transportation from operational to warehouse environment. You would then execute the bulk of transformation processing in the warehouse environment’s staging area.

On-site Staging Model Alternatively, you may choose to perform the cleansing, transformation, and summarization processes locally in the operational environment and then extract to the staging area. This model may conflict with the day-to-day working of the operational system. If chosen, this model’s process should be executed when the operational system is idle or less heavily used.

.....................................................................................................................................................10-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Extracting Data

• Routines developed to select fields from source

• Various data formats

• Rules, audit trails, error correction facilities

Transform

Operationaldatabases

Data stagingarea

Warehousedatabase

Browser:http://HollywoodHollywood XX + +

Customers:a

reco

rof

as

XX + +

Customers:

Browser:http://HollywoodHollywood

Browser:http://HollywoodHollywood

XX + +

Data mapping

Copyright Oracle Corporation, 1999. All rights reserved.

Source Systems

• Production

• Archive

• Internal

• External

Browser:http://Hol lywoodHol lywood

XX + +

Customers:

Browser:http://

HollywoodHollywoodXX + + a

reco

rof

asXX

+ +

Customers:

Browser:http://HollywoodHollywood

12345.0012780.002345787.0087877.985678.00

100%110%230%200%-10%

ABC COGMBH LTDGBUK INC

FFR ASSOCMCD CO

.....................................................................................................................................................Data Warehousing Fundamentals 10-13

.....................................................................................................................................................Extracting Data

Extracting DataThe process of data extraction takes selected data fields that pertain to the subject area maintained by the data warehouse. The data may come from a variety of source systems, and the data may exist in a variety of formats.

The extraction routines are developed to account for the variety of systems from which data is taken. These routines contain data or business rules, as well as audit trails and error correction facilities.

Source Systems The source systems mentioned may be in the form of data existing in:

• Production operational systems

• Archives

• Internal files not directly associated with company operational systems, such as individual spreadsheets and workbooks

• External data from outside the company

Extraction Routines The routines created for extraction are specifically developed to account for the variety of systems from which data is taken. The routines contain data or business rules, audit trails, and error correction facilities. The routines take into account the frequency with which data is to be extracted.

.....................................................................................................................................................10-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

• Operating system platforms

• Hardware platforms

• File systems

• Database systems and vertical applications

Production Data

IMS

DB2

VSAM

NonStop SQL

Oracle

Sybase

Rdb

SAP

Shared MedicalSystems

Dun and BradstreetFinancials

Hogan Financials

Oracle Financials

Browser:http://HollywoodHollywood

XX + +

Customers:

a re

coro

f

a

s

XX + +

Customers:

Browser:http://HollywoodHollywood

Browser:http://

HollywoodHollywoodXX + +

Copyright Oracle Corporation, 1999. All rights reserved.

• Historical data

• Useful for analysis over long periods of time

• Useful for first-time load

• May require unique transformations

Archive Data

Operationaldatabases

Warehousedatabase

.....................................................................................................................................................Data Warehousing Fundamentals 10-15

.....................................................................................................................................................Examining Data Sources

Examining Data Sources

Production Data Production data may come from a multitude of different sources:

• Operating system platforms

• Hardware platforms

• File systems (flat files)

• Database systems, for example, Oracle, DB2, dBase, Informix, ISAM, NonStop SQL, Rdb, and TurboImage

• Vertical applications, such as Oracle Financials, SAP, PeopleSoft, Baan, and Dun and Bradstreet

Archive Data Archive data may be useful to the enterprise in supplying historical data. Historical data is needed if analysis over long periods of time is to be achieved.

Archive data is not used consistently as a source for the warehouse; for example, it would not be used for regular data refreshes. However, for the initial implementation of a data warehouse (and the first-time load), archived data is an important source of historical data.

You need to consider this carefully when planning the data warehouse. How much historical data do you have available for the data warehouse? How much effort is necessary to transform it into an acceptable format?

The data warehouse may need some careful and unique transformations, and clear details of the changes must be maintained in metadata.

.....................................................................................................................................................10-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Internal Data

• Planning, sales, and marketing organization data

• Maintained by:

– Spreadsheets (structured)

– Documents (unstructured)

• Treated like any other source data

Planning

Marketing

Accounting

12345.00

12780.00

2345787.00

87877.98

5678.00

100%

110%

230%

200%

-10%

ABC CO

GMBH LTD

GBUK INC

FFR ASSOC

MCD CO

Warehousedatabase

12345.00

12780.00

2345787.00

87877.98

5678.00

100%

110%

230%

200%

-10%

ABC CO

GMBH LTD

GBUK INC

FFR ASSOC

MCD CO

12345.00

12780.00

2345787.00

87877.98

5678.00

100%

110%

230%

200%

-10%

ABC CO

GMBH LTD

GBUK INC

FFR ASSOC

MCD CO

.....................................................................................................................................................Data Warehousing Fundamentals 10-17

.....................................................................................................................................................Examining Data Sources

Internal DataInternal data may be information prepared by planning, sales, or marketing organizations that contains data such as budgets, forecasts, or sales quotas. The data contains figures (numbers) that are used across the enterprise for comparison purposes. The data is maintained using software packages such as spreadsheets and word processors and uploaded into the warehouse.

Internal data is treated like any other source system data. It must be transformed, documented in metadata, and mapped between the source and target databases.

.....................................................................................................................................................10-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

• Information from outside the organization

• Issues of frequency, format, and predictability

• Described and tracked using metadata

External Data

Barron’s

Dun and Bradstreet

Purchased databases

Wall Street Journal

Economic forecasts

Competitive information

Warehousingdatabases

A.C. Nielsen, IRI, IMS,Walsh America

.....................................................................................................................................................Data Warehousing Fundamentals 10-19

.....................................................................................................................................................Examining Data Sources

External DataExternal data is important if you want to compare the performance of your business against others. There are many sources for external data:

• Periodicals and reports

• External syndicated data feeds (Some warehouses rely regularly on this as a source)

• Competitive analysis information

• Newspapers

• Purchased marketing, competitive, and customer related data

• Free data from the Web

Issues You must consider the following issues with external data:

• Frequency: There is no real pattern like that of internal data. Constant monitoring is required to determine when it is available.

• Format: The data may be different in format than internal data, and the granularity of the data may be an issue. In order to make it useful to the warehouse a certain amount of reformatting may be required. In addition, you may find that external data, particularly that available on the Web, comes with digital audio data, picture image data, and digital video data. These present an interesting challenge in storage and speed of access.

• Predictability: External data is not predictable; it can come from any source at any time, in any format, on any medium.

Tracked Using Metadata Metadata (described earlier as descriptive data about data) plays an invaluable role in the registration, access, and control of external data. The metadata should provide the warehouse manager with as much information about the external data as possible, averting the need to examine the data closely.

Note: ETT decisions and strategies can evolve over time throughout the life of the warehouse. It may be prudent to track those strategies and decisions, so that you can always explain the algorithmic logic or business rules used at different times with current, recent, or archived data.

.....................................................................................................................................................10-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Mapping• Defines which operational attributes to use

• Defines how to transform the attributes for thewarehouse

• Defines where the attributes exist in the warehouse

• Mapping tools are available

File AF1 123F2 BloggsF3 10/12/56

Staging File OneNumber USA123Name Mr. BloggsDOB 10-Dec-56

MetadataFile A Staging File OneF1 NumberF2 NameF3 DOB

.....................................................................................................................................................Data Warehousing Fundamentals 10-21

.....................................................................................................................................................Examining Data Sources

Mapping DataOnce you have determined your business subjects for the warehouse, you need to determine the required attributes from the source systems.

On an attribute-by-attribute basis you must determine how the source data maps into the data warehouse, and what, if any, transformation rules to apply. This is known as mapping. There are mapping tools available.

Mapping information should be maintained in metadata that is server (RDBMS) resident, for ease of access, maintenance, and clarity.

.....................................................................................................................................................10-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

• Programs: C, COBOL, PL/SQL

• Gateways: transparent database access

• In-house development is popular

• Tools

– High initial cost

– Ongoing automation

– Data cleanup

Extraction Techniques

.....................................................................................................................................................Data Warehousing Fundamentals 10-23

.....................................................................................................................................................Extraction Techniques

Extraction TechniquesYou can extract data from different source systems to the warehouse in different ways:

• Programmatically, using procedural languages such as COBOL, C, C++, or Procedural SQL

• Using a gateway to access data sources. This method is acceptable only for small amounts of data; otherwise, the network traffic becomes unacceptably high.

• In-house developed tools that:

– Store a physical definition of the source and warehouse data

– Create data dictionaries

– Generate data conversion programs

– Clean and transform the data

– Allow selective retrieval

– Maintain metadata

Note: In-house development is an ongoing activity that may become a resources black hole. You need local knowledge to support all of the file formats.

• Using a vendor’s data extraction tool

Although it is expensive, an extraction tool:

– Provides ongoing automation of the data extraction process

– Supports data cleanup

More than 50% of companies use their own in-house development teams to develop data extraction programs. The extraction process may access different host systems media, such as fiche, optical, tape, CD, and disk formats.

.....................................................................................................................................................10-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Sources and Targets

OLAP

Data marts

Data analysis

Data mining

Sources ODS Warehouse Access

.....................................................................................................................................................Data Warehousing Fundamentals 10-25

.....................................................................................................................................................Extraction Techniques

Sources and TargetsTo summarize, the data for the warehouse is a complex mixture of structured and unstructured data from different source systems. It all needs to be moved in a clean and integrated state into the warehouse.

Note: The same process is performed for current data that is to reside in an operational data store.

.....................................................................................................................................................10-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Designing Extraction Processes

• Analysis:

– Sources, technologies

– Data types, quality, owners

• Design options:

– Manual, custom, gateway, third-party

– Replication, full, or delta refresh

• Design issues:

– Batch window, volumes, data currency

– Automation, skills needed, resources

.....................................................................................................................................................Data Warehousing Fundamentals 10-27

.....................................................................................................................................................Extraction Techniques

Designing Extraction ProcessesWhen designing your extraction processes, consider the analysis issues, the design options available to you, and the design issues.

Analysis• The sources and technologies used

• Existing data feeds and redo logs

• Data types (EBCDIC or ASCII)

• Data quality and ownership

• Data volumes

• Operational schedule in the source environment

• Spare processing capacity in the source environment

Design Options• Manual data entry

• Custom programs

• Gateway technologies

• Replication techniques

• Third party tools

• Full refresh or delta changes

Design Issues• Batch window

• Data volumes

• Data currency (how up-to-date the data is to be)

• Degree of automation required

• Technology skills needed

• Time and money available

.....................................................................................................................................................10-28 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Maintaining Extraction Metadata

• Source location, type, structure

• Access method

• Privilege information

• Temporary storage

• Failure procedures

• Validity checks

• Handlers for missing data

.....................................................................................................................................................Data Warehousing Fundamentals 10-29

.....................................................................................................................................................Extraction Techniques

Maintaining Extraction MetadataIt is essential to maintain a “metadata trail” of information about all ETT processes, including the extraction process. This information is important for warehouse enhancement and performance improvements.

The quality of metadata is critical for every aspect of the warehouse; attention must be paid to its control, management, and change.

Extraction metadata includes:

• The source location, type, contact, and structure information

• The access method

• The privilege information

• The extraction temporary storage information

• The extraction failure and validity check procedures information

• Information about how to handle missing data

Extraction metadata also contains information about the frequency of program execution and maps the source data to the target database.

.....................................................................................................................................................10-30 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Possible ETT Failures

• A missing source file

• A system failure

• Inadequate metadata

• Poor mapping information

• Inadequate storage planning

• A source structural change

• No contingency plan

• Inadequate data validation

.....................................................................................................................................................Data Warehousing Fundamentals 10-31

.....................................................................................................................................................Extraction Techniques

Possible ETT FailuresETT processes are vital to the warehouse, and they must succeed. ETT may fail for any of the following reasons:

• Extraction routines must specify the name and location of the source data. A missing file may cause the extraction to fail. You must therefore ensure that exception and error handling routines are included.

• If there is a system or media failure during the process, the process may fail entirely. You must start again or you may, depending upon system settings, be able to continue from the point of failure.

• Metadata that inadequately describes the source to destination mapping and rules will cause ETT to fail; for example, when an unexpected value is found.

• Without the space for temporary data, staging data, and sorting operations, ETT fails.

• Any changes to the source systems that are not documented in metadata will cause extraction to fail.

• Contingency plans are needed, including mechanisms for correcting or reapplying processing.

• If data is not validated correctly, the quality of extraction and the success of transformation cannot be guaranteed. This translates to a data warehouse that may contain dirty data at the end of the load.

.....................................................................................................................................................10-32 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Maintaining ETT Quality

• ETT must be:

– Tested

– Documented

– Monitored and reviewed

• Disparate metadata must be coordinated

.....................................................................................................................................................Data Warehousing Fundamentals 10-33

.....................................................................................................................................................Extraction Techniques

Maintaining ETT QualityAny failure of the ETT processes affects data quality, the importance of which cannot be underestimated. Inaccurate data leads to inaccurate analysis results, which lead to bad business decisions. The result of poor data quality is a lack of confidence in the system to deliver the solution.

Testing the Process You should test the proposed ETT techniques to ensure that volumes can be physically moved within the load window constraints and network capabilities.

Documenting the Process You must communicate and document the proposed load processes with the operations organization to ensure their agreement and commitment to this important process.

Monitoring and Reviewing the Process You should ensure that the load is constantly monitored and reviewed, and revise metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and data granularity need regular revision. The grain of the warehouse affects query capabilities and the warehouse size.

.....................................................................................................................................................10-34 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Extraction Tools

Mapping information

Update metadata

JCL files

Map Source Data to Intermediate File StoreMap Source Data to Intermediate File Store

Sales and Marketing

Customer Name

Char Varchar

20

Unique name

Copyright Oracle Corporation, 1999. All rights reserved.

• Base functionality

• Interface features

• Metadata repository

• Open API

• Metadata access

• Repository utilities

• Input and output processing

• Cleansing, reformatting, and auditing

• References

• Training requirements

Selection Criteria

.....................................................................................................................................................Data Warehousing Fundamentals 10-35

.....................................................................................................................................................Extraction Tools

Extraction ToolsExtraction tools normally have a GUI front end that allows you to enter the individual field mappings from source to target systems. The tools normally:

• Generate the required code for the mapping, whether COBOL, C, or any other language

• Create the necessary job control and scheduling files for the specific platform

• Create and manage changes to the metadata

Selection Criteria The warehouse uses a host of different tools for extraction, modeling, management, and access. A tools selection committee must ensure that every tool selected meets identified requirements. This is usually a rigorous process.

If you decide to buy an extraction tool, consider the following fundamental issues:

• Base functionality

• Interface features and functionality

• The metadata repository and the attributes stored in the repository

• Open API

• Access to metadata by end users

• The effectiveness of the way that the tool presents the information

• Repository utilities such as scheduling, name, and address management

• Data extraction inputs and outputs

• Data cleansing, reformatting, and auditing features

Ask the tool vendor for customer references, so that you can ask those customers to describe their goals, successes, and failures with the product.

Consider the training required for the extraction tool. The complexity of the available extraction products varies, as does the ability of your staff. Training may be required for a few days or weeks.

.....................................................................................................................................................10-36 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

WTI Partner ETT Tools

• Carleton

• Constellar

• Evolutionary Technologies

• Informatica

• Information Builders

• Oracle EDMS, Toolkits, OADW

• Prism Solutions

• Sagent

• Vality Technology

.....................................................................................................................................................Data Warehousing Fundamentals 10-37

.....................................................................................................................................................Extraction Tools

WTI Partner ETT Tools

The choice of ETT techniques and tools is often driven by the quality of the source data.

WTI Partner Product

Carleton Corp Carleton Passport, Carleton Passport Development Workbench

Constellar Constellar Hub

Evolutionary Technologies ETI Development Workbench, ETI Extract Tool Suite

Informatica Corporation PowerMart (Designer, Server, and Manager)

Information Builders, Inc. EDA Copy Manager

Oracle EDMS (Extraction and Transformation Template)

Toolkits

OADW

Prism Solutions, Inc. Prism Change Manager, Prism Development Workbench, Prism Warehouse Manager

Sagent Data Mart Suites

Vality Technology, Inc. Integrity Data Re-engineering Tool

.....................................................................................................................................................10-38 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Summary

This lesson discussed the following topics:

• ETT processes are essential and consume a largeproportion of warehouse resources and time

• The extraction process acquires source data

• You may encounter many data sources

• There are many data extraction issues

• ETT Tools should be considered

.....................................................................................................................................................Data Warehousing Fundamentals 10-39

.....................................................................................................................................................Summary

SummaryThis lesson discussed the following topics:

• ETT processes are essential and consume a large proportion of warehouse resources and time

• The extraction process acquires source data

• You may encounter many data sources

• There are many data extraction issues

• ETT Tools should be considered

.....................................................................................................................................................10-40 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Practice 10-1 Overview

This practice covers the following topics:

• Answering a series of short questions

• Specifying true or false to a series of statements

.....................................................................................................................................................Data Warehousing Fundamentals 10-41

.....................................................................................................................................................Practice 10-1

Practice 10-1Please answer the following questions.

1 The acronym ETT stands for _________________________________________.

2 Name at least four potential sources of production data for the warehouse.

_____________________

_____________________

_____________________

_____________________

3 Name at least five potential sources of external data for the warehouse.

___________________________________________

___________________________________________

___________________________________________

___________________________________________

___________________________________________

4 Identify whether the following statements are true or false.

Question True FalseArchive data is never used in a data warehouse; it is too old.

External data is one of the easiest types of data to incorporate into the warehouse.Mapping data is a process whereby you eliminate data inconsistencies.Gateways are great mechanisms for transferring large volumes of data into the warehouse.Extraction tools are expensive.

Transforming data occurs only in the staging area.

.....................................................................................................................................................10-42 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 10: Building the Warehouse

.................................

11

Transforming Data

.....................................................................................................................................................11-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Overview

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building

theWarehouse)

ETT(Building

theWarehouse)

Meeting aBusiness

Need

SupportingEnd UserAccess

Managing the Data

Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Objectives

After completing this lesson, you should be able todo the following:

• Explain the importance of quality data

• Define the term “transformation”

• Identify transformation issues

• Describe techniques for transforming data

• List tools that can be used to transform data

.....................................................................................................................................................Data Warehousing Fundamentals 11-3

.....................................................................................................................................................Overview

OverviewThe last lesson introduced extraction, transformation, and transportation. The lesson then focused on extraction issues.

In this lesson, you explore how the transformation process transforms data from source systems into data suitable for end user query and analysis applications.

Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page.

ObjectivesAt the end of this lesson, you should be able to:

• Explain the importance of quality data

• Define the term “transformation”

• Identify transformation issues

• Describe techniques for transforming data

• List tools that can be used to transform data

.....................................................................................................................................................11-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Importance of Data Quality

Summit Sports

Hollywood

Speedy Pizza

Browser:http://

HollywoodHollywoodXX

+ +

Customers:

a re

cor o

f

a

s

XX + +

Customers:

Browser:http://

HollywoodHollywood

Browser:http://

HollywoodHollywoodXX + +

Copyright Oracle Corporation, 1999. All rights reserved.

Benefits of Quality Data

• Clean data is essential for:

– Targeting customers

– Determining buying patterns

– Identifying householders: private andcommercial

– Matching customers

– Identify historical data

• Dirty data must be removed.

.....................................................................................................................................................Data Warehousing Fundamentals 11-5

.....................................................................................................................................................Importance of Data Quality

Importance of Data Quality

Importance of Quality DataThe importance of quality data in the data warehouse cannot be overemphasized. Although data anomalies are bound to exist in source systems, if they are allowed to get into the data warehouse this leads to inaccurate information, which further leads to inaccurate reports and bad business decisions. The overall result is a lack of confidence in the system to deliver the solution and a data warehouse that either is not used or requires substantial improvement and management buy-in.

Quality data is the key to a successful warehouse; it is better to have no data at all than bad data.

Benefits of Quality DataAll dirty data must be eliminated from the staging area, to ensure you can query the warehouse to:

• Target the right audience for marketing communication

• Determine that a particular customer buys related products

• Determine that a group of people form a family, each of whom is a potential customer (householding)

• Identify that an organization is part of a larger enterprise (commercial householding)

• Identify that a customer is now part of another organization, because of acquisition or take over

• Match customers where there are many different records for the same customer. (For example, the different components of health care, such as the hospital, the pharmacy, and the doctor have their own records, or a patient may be treated by different physicians in the same hospital.)

• Identify the age of data and its history

Note: The terms scrubbing, cleaning, cleansing, and data reengineering are used interchangeably.

.....................................................................................................................................................11-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Standards

• Define a quality strategy

• Decide on optimal data-quality level

Copyright Oracle Corporation, 1999. All rights reserved.

Quality Improvements

• Consider modifying rules for operational data

• Document the sources

• Create a data stewardship program

• Design the cleanup process carefully

• Initial cleanup and refresh routines may differ

.....................................................................................................................................................Data Warehousing Fundamentals 11-7

.....................................................................................................................................................Importance of Data Quality

StandardsA data-quality strategy must be defined early on in the development cycle. It is imperative that you have one in place.

The strategy defines the optimal level of data quality that provides the value required for the business. For example, there is little point in seeking a low data inconsistency rate at great expense if the benefit to the business is not tangible.

Improving Operational Data QualityYou may need to consider making changes over time to the operational system in order to improve the quality of data for the warehouse:

• Some of the validation and integrity rules that are applied to current operational data may need to be modified or enhanced.

• You may need to document previously undocumented sources, enlist the help of users who know the business data, and consider creating a “data stewardship” program.

• You should carefully examine the cleanup processes that you employ in transforming the extracted data.

• The initial data cleanup routines may be different from the routines applied to subsequent data refreshes.

Correcting data can be tedious, time-consuming, and expensive. Consider any modifications in a phased approach rather than fixing all problems in one attempt.

.....................................................................................................................................................11-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Guidelines

• Operational data should not be used directly in thewarehouse

• Operational data must be cleaned for eachincrement

• Operational data is not simply fixed by modifyingapplications

.....................................................................................................................................................Data Warehousing Fundamentals 11-9

.....................................................................................................................................................Importance of Data Quality

GuidelinesDo not assume that because the data in the operational system suits you at the operational level, it is going to be appropriate, suitable, and of a sufficiently high quality for the data warehouse.

• The operational system contains no aging information.

• There are many examples of disparity in the data.

• There are many different meanings applied to data.

• Good operational data when merged may become poor data warehouse data.

Do not assume it is acceptable to clean up data after the pilot run of the first increment or implementation.

• The credibility of the data warehouse or data mart suffers.

• Postimplementation cleanups are more costly and the risk is higher than during the pilot run.

• The programs needed to handle the multitude of problems are very complex and would need to be rewritten after cleanup.

Do not assume that fixing applications at the point of entry (operational system) is going to satisfy quality and clean up the data for the future.

• It is often too time-consuming and costly to continually implement changes at that level.

• Changes cannot be implemented quickly enough to keep up with constantly changing operational requirements.

The cost in time and resources in reengineering the existing legacy data may be too high.

.....................................................................................................................................................11-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Solutions

• Conventional COBOL, 4GL

• Specialized tools

• Customized conversion process

• Business experts

InvestigationConditioning

StandardizationIntegration

Copyright Oracle Corporation, 1999. All rights reserved.

Management

Poor data quality

• Own

• Take responsibility

• Resolve problems

• Data quality manager

.....................................................................................................................................................Data Warehousing Fundamentals 11-11

.....................................................................................................................................................Importance of Data Quality

SolutionsUse conventional COBOL or 4GL programs or purchase a specialized tool to capture and eradicate anomalies prior to data load. It is often very difficult to predict all possible variants.

You may consider designing a process in-house to assure the quality of the data entering the data warehouse. The process must involve:

• Data investigation: Parsing, lexical analysis, and pattern investigation

• Data conditioning and standardization: Moving the data into fixed fields, standardizing names and addresses

• Data integration: Building unique keys and integrating the data

You should involve the business experts in the entire warehouse ETT process.

ManagementYou must manage the quality of the data, processes, and rules, and put people in place to manage them. Someone must own, be directly responsible for, and resolve the issue of poor data quality. This person is often known as the data quality manager.

Note: At some sites there is a person or a group responsible for name and address management alone.

.....................................................................................................................................................11-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transformation

Transformation eliminates operational data anomalies

• Cleans

• Standardizes

• Presents subject-oriented data

Extract

Transform

Operationalsystem

Warehouse

Clean up

Consolidate

Restructure

Datastaging

area Transport(Load)

Copyright Oracle Corporation, 1999. All rights reserved.

Source Data Anomalies

• No unique key

• Data naming and coding anomalies

• Data meaning anomalies between groups

• Spelling and text inconsistencies

90328575 Oracle Corp 100 NE 1st Street, Tampa 90328575 Oracle 100 NE. First St., Tampa90238475 Oracle Services 100 North East 1st St., FLA90233479 Oracle Limited 100 N.E. 1st St.90233489 Oracle Computing 15 Main Road, Ft. Lauderdale90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA

CUSNUM NAME ADDRESS

.....................................................................................................................................................Data Warehousing Fundamentals 11-13

.....................................................................................................................................................Transformation

Transformation Transformation involves a number of tasks, the most important being to eliminate all anomalies. Cleaning also includes eliminating formatting differences, assigning data types, defining consistent units of measure, and determining encoded structures. Along with these tasks, another objective is to ensure that the data is presented in a subject-oriented fashion.

Reasons for Data AnomaliesOne of the causes of inconsistencies within internal data is that in-house system development takes place over many years, often with different software and development standards for each implementation.

There may be no consistent policy for the software used in the corporate environment. Systems may be upgraded or changed over the years. Each system may represent data in different ways.

Source Data AnomaliesMany potential problems can exist with source data:

• No unique key for individual records

• Anomalies within data fields, such as differences between naming and coding (data type) conventions

• Differences in the interpreted meaning of the data by different user groups

• Spelling errors and other textual inconsistencies (this is particularly relevant in the area of customer names and addresses)

.....................................................................................................................................................11-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transformation Routines

• Cleaning data

• Eliminating inconsistencies

• Adding elements

• Merging data

• Integrating data

• Transforming data before load

.....................................................................................................................................................Data Warehousing Fundamentals 11-15

.....................................................................................................................................................Transformation

Transformation RoutinesOne reason for the inconsistencies with internal data is that in-house system development takes place over many years and often uses different software and standards for each implementation.

• Cleaning the data, also referred to as data cleansing or scrubbing

• Adding an element of time to the data, if it does not already exist

• Translating the formats of external and purchased data into something meaningful for the warehouse

• Merging rows or records in files

• Integrating all the data into files and formats to be loaded into the warehouse

Transformation should be performed:

• Before the data is loaded into the warehouse

• In parallel (On larger databases, there is not enough time to perform this process as a single threaded process.)

The transformation process should be self-documenting, should generate summary statistics, and should process exceptions.

.....................................................................................................................................................11-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transforming Data: Problems andSolutions

Multipart keys

Country code

Sales territory

Product number

Salesperson code

Product code = 12M65431345

.....................................................................................................................................................Data Warehousing Fundamentals 11-17

.....................................................................................................................................................Transforming Data: Problems and Solutions

Transforming Data: Problems and Solutions

Multipart Keys ProblemMany older operational systems used record key structures that had a built-in meaning. To allow for decision support reporting, these keys must be broken down into atomic values.

In the example, the key contains four atomic values.Key Code:12M65431345

Where:

12 is the country code

M is the sales territory

65431 is the product code

345 is the salesperson

Solution The program or tools you use must be capable of identifying on a character-by-character (or position-by-position) basis the individual values, length of value, and the meaning of the resulting information. In the example quoted it is important that the code can extract the M and know that this is a territory code that identifies “Midwest,” “Manchester,” or “Moscow.”

You may need to build a series of transforms to evaluate the results fully. For example, these steps may be appropriate:

1 Extract third character position.

2 Evaluate the character against a master lookup table.

3 Evaluate the meaning of M.

4 Store the meaning (Moscow) in a field for insertion into the data warehouse.

.....................................................................................................................................................11-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

If field not in (‘m’,1,’male’)then …

else if field is NULL then …

Transforming Data

• Multiple encoding

• Must pick up erroneous data

m , f

1 , 0

male, female

m, f

m, fmle, female

1 , NULL

Copyright Oracle Corporation, 1999. All rights reserved.

Transforming Data

• Multiple local standards

• Tools or filters to preprocess

cm

inchescm

DD/MM/YY

MM/DD/YYDD-Mon-YY

1,000 GBP

FF 9,990USD 600

.....................................................................................................................................................Data Warehousing Fundamentals 11-19

.....................................................................................................................................................Transforming Data: Problems and Solutions

Multiple Encoding ProblemSome systems may represent values in different ways.

For example, some systems may use M to denote “male” and F to denote “female”, while others use 1 and 0, or even NULL values.

Solution The program must be capable of identifying all the distinct possibilities and program for exceptions. For example, your program considers a male might be either M, or NULL, or Male, but it does not take into account spurious and bad entries such as Man, Mle, N/A.

Your program must be capable of picking up the spurious and bad entries and changing the values to something appropriate, such as:

1 Select all M, or NULL, or Male.

2 Place all other records into a file for reprocessing.

3 Interpret records to be reprocessed and determine from other related values in the record whether the person is male or female.

4 Change value accordingly, and reprocesses rows selecting newly marked records.

Multiple Local Standards ProblemThis is particularly relevant for values entered in different countries.

For example, some countries use imperial measurements and others metric; currencies and date formats differ; currency values and character sets may vary; and numeric precision values may differ.

Currency values are often stored in two formats, a local currency such as sterling, French francs, or Australian dollars, and a global currency such as U.S. dollars.

Solution Typically, you use tools or filters to preprocess this data into a suitable format for the database, with the logic needed to interpret and reconstitute a value. You might employ steps similar to those identified for multiple encoding.

You may consider revising source applications to eliminate these inconsistencies early on.

.....................................................................................................................................................11-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Multiple Files Problem

• Added complexity of multiple source files

• Start simple

Extracteddata

Multiplesource files

Logic to detectcorrect source

Copyright Oracle Corporation, 1999. All rights reserved.

Transforming Data from Multiple Files

02468

10121416

2 3 4 5 6

Sources to be Incorporated

Conflict and integration points

File

File

File

File File File

FileFile

File

.....................................................................................................................................................Data Warehousing Fundamentals 11-21

.....................................................................................................................................................Transforming Data: Problems and Solutions

Multiple Files ProblemThe source of information may be one file for one condition, and a set of files for another. Logic (normally procedural) must be in place to detect the right source.

The complexity of integrating data is greatly increased according to the number of data sources being integrated.

For example, if you are integrating data from two sources, there is a single point of integration where conflicts must be sorted. Integrate from three sources, and there are three points of conflict. Four sources provide six conflict points. The problem is exponential.

Solution This is a complex problem that requires the use of tools or well-documented transformation mechanisms.

Try not to integrate all the sources in the first instance. Start with two or three and then enhance the program to incorporate more sources. Build on your learning experiences.

.....................................................................................................................................................11-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Missing Values Problem

Solution

• Ignore

• Wait

• Mark rows

• Extract when time-stamped

If NULL thenfield = ‘A’

A

Copyright Oracle Corporation, 1999. All rights reserved.

Duplicate Value Problem

Solution

• SQL self-join techniques

• RDMBS constraint utilities

ACME Inc

ACME Inc

ACME Inc ACME Inc

SELECT …FROM table_a, table_bWHERE table_a.key (+) = table_b.keyUNIONSELECT …FROM table_a, table_bWHERE table_a.key = table_b.key (+)

.....................................................................................................................................................Data Warehousing Fundamentals 11-23

.....................................................................................................................................................Transforming Data: Problems and Solutions

Missing Values ProblemNull, missing, and default values are always an issue. NULL values may be valid entries where NULLs are allowed; otherwise, NULLs indicate missing values.

Solution You must examine each occurrence of the condition to determine validity and decide whether these occurrences must be transformed; that is, identify whether a NULL is valid or invalid (missing data). You may choose to:

• Ignore the missing data. If the volume of records is relatively small, it may have little impact overall.

• Wait to extract the data until you are sure that missing values are entered from the operational system.

• Mark rows when extracted, so that on the next extract you can select only those rows not previously extracted. It does involve the overhead of SELECT and UPDATE, and if the extracted data forms the basis of a summary table, these need re-creating.

• Extract data only when it is time-stamped as completed, rather than by business cycle.

Duplicate Value ProblemYou need to eliminate duplicate values, which invariably exist. This can be time-consuming, although it is a simple task to perform.

Solution You can use standard SQL self-join techniques or RDBMS constraint utilities to eliminate duplicates.

.....................................................................................................................................................11-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Solution

• CTAS

• SQL*Loader

Element Names Problem

Customer

Browser:http://HollywoodHollywood

XX + +

Customers:

a re

coro

f

a

sXX + +

Customers:

Browser:http://

Hol lywoodHol lywood

Browser:http://HollywoodHollywood

XX + +

12345.0012780.002345787.0087877.985678.00

100%110%230%200%-10%

ABC COGMBH LTDGBUK INC

FFR ASSOCMCD CO

Customer

Client

Contact

Name

Copyright Oracle Corporation, 1999. All rights reserved.

Element Meaning Problem

Customer’s name

All customerdetails

All detailsexcept name

a re

coro

f

as

XX

+ +

Customers:

Browser:

http://

HollywoodHollywood

Customer_detail

• Avoid misinterpretation

• Complex solution

• Document meaning in metadata

.....................................................................................................................................................Data Warehousing Fundamentals 11-25

.....................................................................................................................................................Transforming Data: Problems and Solutions

Element Names ProblemIndividual attributes, columns, or fields may vary in their naming conventions from one source to another. These need to be eliminated to ensure that one naming convention is applied to the value in the warehouse.

If you are employing independent data marts, then you should ensure that the ETT solution is mirrored; should you plan to employ the data marts dependently in the future, they will all refer to the same object.

Solution You need to obtain agreement from all relevant user groups on renaming conventions, and rename the elements accordingly. Document the changes in metadata.

The programs you use determine the solution. For example, if you are using SQL CREATE TABLE AS (CTAS), the new column name is used in that statement. If you use SQL*Loader as an intermediary mechanism prior to load, you create your destination object with the agreed naming convention applied.

Agreement on the name change and the meaning of the data can become a political issue between groups and departments in the organization.

Element Meaning ProblemLike the name of an element, the meaning is often interpreted differently by different user groups. The variations in naming conventions typically drive this misinterpretation. You need to keep your model independent of naming conventions that may be popular today, but subject to change.

Solution It is a difficult problem, often political, but you must ensure that the meaning is clear. By documenting the meaning in metadata you can solve this problem, especially if the meaning is composed of several elements and algorithms have been used.

In order to take information from the operational system into the warehouse, you must know the meaning of the data. This may involve rebuilding the transaction from its component parts (which are likely in a normalized state). You must know the:

• Business rules

• Processes executed for a type of transaction, such as the tables that are updated

This is a complex task, which may involve merging or separating data components, extracting values from multipart keys, and much more.

.....................................................................................................................................................11-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Input Format Problem

ASCIIEBCDIC

12373“123-73”

ACME Co.

áøåëéí äáàéí Beer (Pack of 8)

Copyright Oracle Corporation, 1999. All rights reserved.

Referential Integrity Problem

Solution

• SQL anti-join

• Server constraints

• Dedicated toolsDepartment10203040

Emp Name Department1099 Smith 101289 Jones 201234 Doe 506786 Harris 60

.....................................................................................................................................................Data Warehousing Fundamentals 11-27

.....................................................................................................................................................Transforming Data: Problems and Solutions

Input Format ProblemInput formats vary considerably.

For example one entry may accept alphanumeric data, so the format may be “123-73”. Another entry may accept numeric data only, so the format may be “12373”.

You may also need to convert from ASCII to EBCDIC, or even convert complex character sets such as Hebrew, Arabic, or Japanese.

Solution First, ensure that you document the original and the resulting formats.

Your program (or tool) must then convert those data types either dynamically or through a series of transforms into one acceptable format.

You can use Oracle SQL*Loader to perform certain transformations, such as EBCDIC to ASCII conversions and assigning values to default or NULL values.

Referential Integrity ProblemIf the constraints at the application or database level have in the past been less than accurate, child and parent record relationships can suffer; orphaned records can exist.

You must understand data relationships built into legacy systems. The biggest problem encountered here is that they are often undocumented. You must gain the support of users and technicians to help you with analysis and documentation of the source data.

Solution This is a simple cleaning task, but it is time-consuming and requires business experience to resolve the inconsistencies. You can use SQL anti-join query techniques, server constraint utilities, or dedicated tools to eliminate these inconsistencies.

.....................................................................................................................................................11-28 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Name and Address Problem

• No unique key

• Missing values

• Personal and commercial names mixed

• Different addresses for same member

• Different names and spelling for same member

• Many names on one line

• One name on two lines

Database 1

Database 2

DIANNE ZIEFELD N100HARRY H. ENFIELD D589FRED AND SARA MULLEN M300

NAME LOCATION

ZIEFLED, DIANNE 100ENFIELD, HARRY H 589MULLEN, SARA AND FRED 300

Copyright Oracle Corporation, 1999. All rights reserved.

Name and Address Problem

• Single-field format

• Multiple-field format

Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565

Name Mr. J. SmithStreet 100 Main St.Town BigtownCounty County LuthCode 23565

.....................................................................................................................................................Data Warehousing Fundamentals 11-29

.....................................................................................................................................................Transforming Data: Problems and Solutions

Name and Address ProblemOne of the largest areas of concern, with regard to data quality, is how name and address information is held, and how to transform it. Name and address information has historically suffered from a lack of legacy standards. This information has been stored in many different formats, sometimes dependent upon the software or even the data processing center used.

Usual Inconsistencies Some of the following data inconsistencies may appear:

• No unique key

• Missing data values (NULLs)

• Personal and commercial names mixed

• Different addresses for same member

• Different names and spelling for same member

• Many names on one line

• One name on two lines

• The data may be in a single field of no fixed format:

Mr. J. Smith, 100 Main St., Bigtown, County Luth, 23565

Each component of an address may be in a specific field:

Mr. J. Smith

100 Main St.

Bigtown

County Luth

23565

.....................................................................................................................................................11-30 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Clean and Organize

1. Create atomic values.

2. Standardize formats.

3. Verify data accuracy.

4. Match with other records.

5. Identify private and commercial addresses and inhabitants.

6. Document in metadata.

Requires sophisticated tools and techniques

.....................................................................................................................................................Data Warehousing Fundamentals 11-31

.....................................................................................................................................................Transforming Data: Problems and Solutions

Name and Address Problem (continued)

Solution Name and address cleanup involves a series of complex processes that decompose and reassemble data. It can be broken down into a number of steps; those identified here represent just one example.

Mr. J. Smith, 100 Main St., Bigtown, County Luth, 23565

Steps to Clean and Organize1 Break the record down into atomic values, each of which has a description.

2 Ensure that all elements appear in a standard format, so that St. in this example becomes Street. This element needs to be recoded, as do other similar elements, such as Rd and Cres.

3 Verify the accuracy of standard elements using data from external sources.

– Is Bigtown actually associated with this postal code?

– Is Bigtown in County Luth?

– Is County Luth associated with this postal code?

4 Check whether there are any other customers with the name Smith. If there are, verify whether the addresses are identical; if they are not, then one is probably the current address and others are old addresses. You probably have to refer to external data to check this. Mark records with notes such as previous and current.

5 Identify whether there is more than one customer record for any given address. You may find a Smith, and a Doe, and a Jones all at 100 Main Street. Are they all resident in the same house or apartment?

6 Document the results of these steps in metadata.

You can see from the complexity of even this simple example that this cleanup requires sophisticated software techniques, tools, or expert knowledge in coding the algorithms required to perform each step.

Value DescriptionTitle Mr.First Initial JLast Name SmithHouse Number 100.... ....

.....................................................................................................................................................11-32 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Merging Data

• Operational transactions do not usually mapone-to-one with warehouse data

• Data for the warehouse is merged to provideinformation for analysis

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $15.00

Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

Return 1/2/98 12:00:03 Anchovy Pizza - $12.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Pizza sales/returns by day, hour, seconds

Copyright Oracle Corporation, 1999. All rights reserved.

Merging Dataa

reco

rof

as

XX

+ +

Customers:

Browser:

http://

HollywoodHollywood

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $15.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

Return 1/2/98 12:00:03 Anchovy Pizza - $12.00

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $15.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00

.....................................................................................................................................................Data Warehousing Fundamentals 11-33

.....................................................................................................................................................Transformation Techniques

Transformation Techniques

Merging DataAn operational transaction does not usually have a one-to-one mapping with data in the warehouse, even if the data in the warehouse is maintained at the transaction level.

For example, consider a sales transaction in a store. The logical transaction comprises a number of components such as date of sale, charge amount, number of items, discount amount, and payment method. The transaction may even be a return.

A customer purchase and a customer return are very different types of sales transactions, and different business rules must apply. For each different transaction a different process occurs. A purchase depletes inventory and a return adds stock back into inventory.

The result is, for the warehouse, that the data you are keeping is held for purely reporting purposes and these transactions become merged into data that is useful for that purpose. The data will not, in the end, map strictly to sales or returns.

.....................................................................................................................................................11-34 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Adding a Date Stamp

• Enables time analysis

• Label loaded data with a date stamp

• Add time to fact and dimension data

Copyright Oracle Corporation, 1999. All rights reserved.

Adding a Date Stamp

Item TableItem_idDept_id

Time_key

Time TableWeek_idPeriod_idYear_id

Time_key

Store TableStore_id

District_idTime_key

Product TableProduct_idTime_key

Product_desc

Sales Fact TableItem_idStore_id

Time_keySales_dollarsSales_units

.....................................................................................................................................................Data Warehousing Fundamentals 11-35

.....................................................................................................................................................Transformation Techniques

Adding a Date StampTime is important within the data warehouse. You have already looked at the time dimension, which is always created in the warehouse in order to provide reporting by time periods.

Extracted source data probably does not contain time information, because it is not typical of time-stamp information in operational systems (unless of course they too are maintaining history, or time is a critical component). More likely the record in the operational system has a value associated with it, such as Order_date, Ship_date, or Call_date.

Therefore it is important to consider how you are going to add a time element to your warehouse data. This is particularly important for two areas of the warehouse:

• Fact tables that hold vast amounts of data used to analyze the business according to time periods

• Dimension data containing criteria by which you perform the analysis

You need to consider how to manage time for both of these areas, in slightly different ways.

.....................................................................................................................................................11-36 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Adding a Date Stamp

• Fact table

– Add triggers

– Recode applications

– Compare tables

• Dimension table

• Time representation

– Point in time

– Time span

.....................................................................................................................................................Data Warehousing Fundamentals 11-37

.....................................................................................................................................................Transformation Techniques

Adding a Date Stamp (continued)

Fact Table Data Imagine that you need to add the next set of records from the source systems to your fact table. You need to determine which records are to be moved into the fact table. You have added data for March 1998. Now you need to add data for April 1998. You need to find a mechanism to stamp records so that you pick up only April 1998 records for the next refresh.

You might choose from a number of techniques: Coded application or database triggers at the operational level to time-stamp data, which can then be extracted using date selection criteria.

• Perform a comparison of tables, original and new, to identify differences.

• Maintain a table containing copies of changed records to be loaded.

You must decide which are the best techniques for you to use according to your current system implementations. These are discussed in greater detail later in the course.

Dimension Table Data Dimensions change also and there are many different techniques you can employ to trap changes. Some of these were identified earlier with fact tables.

Time Representation The time may be represented as:

• A single point-in-time date

• A date range (start and end date)

The time element must either be available in the data before loading into the warehouse, or added when loading the data.

.....................................................................................................................................................11-38 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Adding Keys to Data

#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00

#3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

#5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

#4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00

#dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00

#dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Data values or artificial keys

.....................................................................................................................................................Data Warehousing Fundamentals 11-39

.....................................................................................................................................................Transformation Techniques

Adding Keys to DataYou are moving the data from one structure, with its keys defining relationships, into another that is totally different and must also have keys defining relationships.

The transformation of this data also includes adding keys (generalized or artificial) or creating keys from existing data values.

Note: Creating keys is discussed in more detail later in the course.

.....................................................................................................................................................11-40 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Summarizing Data

• During extraction on staging area

• After loading onto the warehouse server

Operationaldatabases

Stagingarea

Warehousedatabase

a re

coro

f

a

sXX + +

Customers:

Browser:http://HollywoodHollywood

.....................................................................................................................................................Data Warehousing Fundamentals 11-41

.....................................................................................................................................................Transformation Techniques

Creating Summary DataCreating summary data is essential for a data warehouse to perform well. Here it is classified under transformation only because you are changing the way the data exists in the source system into something else for the data warehouse.

In reality, the summary data is usually created on the warehouse server after transformation.

Summarizing Data You can summarize the data:

• At the time of extraction in batch routines.

This reduces the amount of work performed by the data warehouse server, as all the effort is concentrated on the source systems. However, summarizing at this time increases:

– The complexity and time taken to perform the extract

– The number of files created

– The number of load routines

– The complexity of the scheduling process

• After the data is loaded into the warehouse database.

The process queries the fact data, summarizes it, and places it into the requisite summary fact table. This method reduces the complexity and time taken for the extract tasks. However, it places all the CPU and I/O intensive work on the warehouse server, thus increasing the time that the warehouse is unavailable to the users.

You should weigh the benefits of each method and determine your strategy according to your requirements and resources.

.....................................................................................................................................................11-42 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Maintaining Transformation Metadata

Contains transformation rules, algorithms, androutines

Browser:http://HollywoodHollywood XX + +

Customers:

a re

coro

f

a

s

XX + +

Customers:

Browser:http://HollywoodHollywood

Browser:http://HollywoodHollywood

XX + +

SourcesExtract

StageTransform

RulesLoad

PublishQuery

Copyright Oracle Corporation, 1999. All rights reserved.

Maintaining Transformation Metadata

• Key restructuring

• Coding differences

• Multiple sources

• Exception rules

• Format differences

• Referential integrity fixes

• Aggregated data

.....................................................................................................................................................Data Warehousing Fundamentals 11-43

.....................................................................................................................................................Transformation Techniques

Maintaining Transformation MetadataAs with the extraction process, metadata must be maintained for the transformation process.

• Information on how to perform key restructuring

• Logic to eliminate different coding methods and data values, parsing rules

• Logic to detect multiple source files

• Logic and exception rules to handle NULL, negative values, and default values and to eliminate and consolidate duplicate values

• Element renaming conventions

• Granularity conversions, input or language formats, conversion algorithms, and data standardization rules

• Referential integrity fixes

• Logic and program names used to create summary data

• Transformation frequency, program name, location, failure procedures, and validation

• Temporary extraction storage location, name, and source contact

The metadata also contains information about the frequency of program execution. Data repair usually involves using simple algorithms or more complex artificial intelligence programs to correct data.

Note: There is a lesson dedicated to metadata later in the course.

.....................................................................................................................................................11-44 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Data Ownership and Responsibilities

• Operational and application development teams

• Data warehouse development team

• Business benefit gained with a one-team approach

Browser:http://

HollywoodHolly woodXX

+ +

Customers:

a re c

orof

a

s

XX + +

Customers:

Browser:http://

HollywoodHolly wood

Browser:http://

HollywoodHollywood

XX + +

.....................................................................................................................................................Data Warehousing Fundamentals 11-45

.....................................................................................................................................................Transformation Techniques

Data Ownership and Responsibilities

Ownership The data extracted from the source systems is often under the control and ownership of application development teams who have been working with the operational data since its inception. The loading of the data into the warehouse is usually under the control of the data warehousing development team.

This raises the question of who is responsible for the transformation of the data: the process between developing and loading the data into the warehouse.

Working as One Team These two teams must work together—those responsible for operational data and those responsible for warehouse data. It brings all the required knowledge together and produces the best solution. Working together enhances understanding, knowledge, teamwork, and a leveling of roles within the groups.

• The operational team may be critical to ensuring the success of the data extraction and providing the data warehouse team with extract files in requisite formats (for example C, COBOL, PL/SQL).

• The data warehouse team can then take on the task of making sure the extracted data is accurate and of sufficiently high quality for the warehouse.

If there is a need to reconsider how the operational data is entered (stored at the database level), to improve the ease of creating extracts and the quality of extract data, then teamwork and understanding of each other’s areas are critical.

.....................................................................................................................................................11-46 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transformation Timing and Location

• Transformation is performed:

– Before load

– In parallel

• May be initiated at different points12M65431

12-m-65421

“12m65421”

“12m65421”

“ ”

12M65431

12M65431

12-m-65421

“12m65421”

“12m65421”

“ ”

12M65431

12

12

12

M

m

m

65431

65421

65421

12

12

12

M

M

m

65431

65421

65421

Unlikely Probable Possible

.....................................................................................................................................................Data Warehousing Fundamentals 11-47

.....................................................................................................................................................Transformation Techniques

Transformation PointsYou need to consider carefully when and where you perform transformation. You must perform transformation before the data is loaded into the warehouse, and in parallel; on larger databases, there is not enough time to perform this process as a single threaded process.

Consider the different places and points in time where transformation may take place.

On the Operational Platform This approach transforms the data on the operational platform, where the source data resides.

The negative impact of this approach is that the transformation operation conflicts with the day-to-day working of the operational system.

If it is chosen, the process should be executed when the operational system is idle or less utilized. The impact of this approach is so great that is very unlikely to be employed.

In a Separate Staging Area This approach transforms data on a separate computing environment, the staging area, where summary data may also be created.

This is a common approach because it does not affect either the operational or warehouse environment. Cleaning, merging, and removal of anomalies are handled in the staging area, and summary creation may take place:

• On the staging server

• On the warehouse server

On the Warehouse Server You may consider performing transformations on the warehouse server itself. However, this may affect the effectiveness of the server for query access.

It is more likely that you transform away from the warehouse server.

.....................................................................................................................................................11-48 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Choosing a Transformation Point

• Workload

• Environment impact

• CPU use

• Disk space

• Network bandwidth

• Parallel execution

• Load window time

• User information needs

Copyright Oracle Corporation, 1999. All rights reserved.

Monitoring and Tracking

Transforms should:

• Be self-documenting

• Provide summary statistics

• Handle process exceptions

12M65431

12-m-65421

“12m65421”

“12m65421”

“ ”

12M65431

12M65431

12-m-65421

“12m65421”

“12m65421”

“ ”

12M65431

12

12

12

M

m

m

65431

65421

65421

12

12

12

M

M

m

65431

65421

654211

23

4

5

1,200

1,400

100

6,001

20,890

.....................................................................................................................................................Data Warehousing Fundamentals 11-49

.....................................................................................................................................................Transformation Techniques

Choosing a Transformation PointThe approach you choose depends upon operational requirements. You must balance many different factors in order to determine the best solution. Consider:

• The actual workload (time to complete) of the transformations needed to provide the data for the warehouse

• The physical impact on each of the environments you might choose. (This is particularly relevant if you choose to use the operational platform.)

• The available CPU and disk space (for temporary and intermediate data and file store) on each environment

• The available network and bandwidth between environments, affecting transfer volumes

• Whether the environment is capable of working in a parallel manner

• The load window time constraints

• The information needs of the business user. (When do they need this data? How often do refreshes occur?)

Monitoring and TrackingThe transformations should be self-documenting, should generate summary statistics, and should be able to process exceptions.

.....................................................................................................................................................11-50 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Designing Transformation Processes

• Analysis:

– Sources and target mappings,business rules

– Key users, metadata, grain

• Design options: PL/SQL, replication,custom, third-party tools

• Design issues:

– Performance

– Size of the staging area

– Exception handling, integritymaintenance

.....................................................................................................................................................Data Warehousing Fundamentals 11-51

.....................................................................................................................................................Transformation Techniques

Designing Transformation ProcessesWhen designing your transformation processes, consider the analysis issues, the design options available to you, and the design issues.

Analysis• Source and target mappings

• Business rules

• Key users

• Metadata

• Granularity of the fact data and summaries

Design Options• PL/SQL

• Replication

• Custom 3GL programs

• Third-party tools

Design Issues• Performance and throughput

• Sizing the staging areas to hold the data to be loaded into the warehouse

• Exception handling

• Integrity maintenance

.....................................................................................................................................................11-52 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transformation Tools

• Purchased

• SQL*Loader

• In-house developed

.....................................................................................................................................................Data Warehousing Fundamentals 11-53

.....................................................................................................................................................Transformation Tools

Transformation ToolsMany of the purchased transformation tools perform extraction as well. The choice of transformation tool may already have been decided when you chose the extraction tool. However, transformation can be performed by:

• Tools purchased from specialized vendors both third-party and Oracle

• SQL*Loader. This is an Oracle product that is commonly used to transport large volumes of data into the warehouse tables. It can also provide you with simple data transformations, such as multiple records becoming a single record, or conversely a single record at source becoming multiple records for the data warehouse.

• In-house developed programs and procedures using 3GL products such as C, C++, COBOL, or 4GL products such as SQL and PL/SQL. The DECODE SQL function can be used to test a value and change it to another value. For example, change “M” and “F” to Male and Female.

DECODE is fast, because it is a SQL set processing function and takes advantage of parallel processing. You should be aware that PL/SQL does not take advantage of parallel processing capabilities and is slower than DECODE because it processes row by row.

.....................................................................................................................................................11-54 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Data Management, Quality, and AuditingTools

• Data management:

– Innovative Systems

– Postalsoft

– Vality Technology

• Data quality and auditing:

– Innovative Systems

– Vality Technology

.....................................................................................................................................................Data Warehousing Fundamentals 11-55

.....................................................................................................................................................Transformation Tools

Data Management, Quality and Auditing Tools

Management Tools

Quality and Auditing Tools

WTI Partner Product

Innovative Systems, Inc. Innovative Warehouse

Postalsoft, Inc. Address Correction and Encoding (ACE)

Vality Technology, Inc. Integrity Data Re-engineering Tool

WTI Partner Product

Innovative Systems, Inc. ISI Analyzer System

Vality Technology, Inc. Integrity Data Re-engineering Tool

.....................................................................................................................................................11-56 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Summary

This lesson discussed the following topics:

• Importance of data quality

• Transformation process

• Data transformation issues

• Data anomalies

• Name and address management

• Tools

.....................................................................................................................................................Data Warehousing Fundamentals 11-57

.....................................................................................................................................................Summary

SummaryThis lesson addressed the following topics:

• The importance of data quality in the warehouse

• The transformation process

• Transformation issues

• Anomalies that may exist in legacy systems

• Name and address management

• Tools available for extraction, transformation, and data quality

.....................................................................................................................................................11-58 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

Copyright Oracle Corporation, 1999. All rights reserved.

Practice 11-1 Overview

This practice covers the following topics:

• Answering a series of short questions

• Specifying true or false to a series of statements

.....................................................................................................................................................Data Warehousing Fundamentals 11-59

.....................................................................................................................................................Practice 11-1

Practice 11-11 Dirty data must be eliminated for the data warehouse. Name three alternative and

common terms used to describe the process of eliminating anomalies in data.

_____________________

_____________________

_____________________

2 Name at least five problems associated with source data that must be eliminated for the data warehouse.

___________________________________________

___________________________________________

___________________________________________

___________________________________________

___________________________________________

3 Identify whether the following statements are true or false.

Question True FalseIt is considered impractical to eliminate data anomalies after the pilot run.You need to consider adding time keys to warehouse data.

Transformation can be performed before or after data is loaded into the warehouse.

.....................................................................................................................................................11-60 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 11: Transforming Data

.................................

12

Transportation: LoadingWarehouse Data

.....................................................................................................................................................12-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Overview

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building

theWarehouse)

ETT(Building

theWarehouse)

Meeting aBusiness

Need

SupportingEnd UserAccess

Managing the Data

Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Objectives

After completing this lesson, you should be able todo the following:

• Explain key concepts in transporting data into thewarehouse

• Outline how to build the transportation process forfirst time load

• Identify transportation techniques

• Identify the tasks that take place after data isloaded

• Explain the issues involved in designing thetransportation, loading, and scheduling processes

.....................................................................................................................................................Data Warehousing Fundamentals 12-3

.....................................................................................................................................................Overview

OverviewIn the last two lessons, you examined extraction and transformation issues.

In this lesson, you examine how the extracted and transformed data is transported into the warehouse as the first-time loading of data.

Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page.

ObjectivesAt the end of this lesson, you should be able to:

• Explain key concepts in transporting data into the warehouse

• Outline how to build the transportation process for the first time load

• Identify transportation techniques

• Identify the tasks which take place after data is loaded

• Explain the issues involved in designing the transportation, loading, and scheduling processes

.....................................................................................................................................................12-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transporting Data into the Warehouse

• Loading moves the data into the warehouse

• Loading can be time-consuming:

– Consider the load window.

– Schedule the task; automate all processes.

• Initial load moves large volumes

• Subsequent refresh moves smaller volumes

• Business determines the cycle

ExtractTransform

OperationalSystem Transport

(load)

DataStaging

AreaWarehouse

.....................................................................................................................................................Data Warehousing Fundamentals 12-5

.....................................................................................................................................................Transporting Data into the Warehouse

Transporting Data into the Warehouse

Transportation TasksThe transportation process moves data from source data stores or an intermediate staging area and loads it into the target warehouse database in the target system server.

This process comprises a series of actions, such as moving the data and loading data into tables. There may also be some processing of objects after the load, often referred to as postload processing.

Moving and Loading DataTo move and load the data can be a time-consuming task, depending upon the volumes of data, the hardware, the connectivity setup, and whether parallel operations are in place. The time period within which the warehouse system can perform the load is called the load window.

Loading should be scheduled and prioritized. You should also ensure that the loading is automated as much as possible.

Types of Data LoadThere is a single first-time load that moves large volumes of data when the warehouse is implemented. The first-time load is followed by regular refreshes of the warehouse with smaller volumes of data, the grain and frequency of which is determined by the business user requirements.

.....................................................................................................................................................12-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Extract Processing Environment

• After each time interval, build a new database

• Run queries

T1 T2 T3

Operationaldatabases

Copyright Oracle Corporation, 1999. All rights reserved.

Warehouse Processing Environment

• Build a new database

• After each time interval, add changes to database

• Archive or purge oldest data

• Run queries

T1 T2 T3

Operationaldatabases

.....................................................................................................................................................Data Warehousing Fundamentals 12-7

.....................................................................................................................................................Transporting Data into the Warehouse

Data Refresh ModelsFirst, to ensure that you understand how the warehouse data presentation differs from nonwarehouse data presentation, consider how up-to-date data is presented to users in two different decision support environments: a simple extract processing environment and a data warehouse environment.

Extract Processing Environment A snapshot of operational data is taken at regular time intervals: T1, T2, and T3. At each interval a new snapshot of the database is created and presented to the user; the old snapshot is purged.

Warehouse Environment An initial snapshot is taken and the database is loaded with data. At regular time intervals, T1, T2, and T3, a delta database or file is created and the warehouse is refreshed. A delta contains only the changes made to operational data that need to be reflected in the data warehouse.

• The warehouse fact data is refreshed according to the refresh cycle determined by user requirements analysis.

• The warehouse dimension data is updated to reflect the current state of the business, only when changes are detected in the source systems.

• The older snapshot of data is not removed, ensuring that the warehouse contains the historical data needed for analysis.

• The oldest snapshots are archived or purged only when the data is not required any longer.

.....................................................................................................................................................12-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

First-Time Load

• Single event that populates the database withhistorical data

• Involves large volume of data

• Employs distinct ETT tasks

• Involves large amounts of processing after load

T1 T2 T3

Operationaldatabases

Copyright Oracle Corporation, 1999. All rights reserved.

Refresh

• Performed according to a business cycle

• Simpler task

• Less data to load than first-time load

• Less-complex ETT

• Smaller amounts of postload processing

T1 T2 T3

Operationaldatabases

.....................................................................................................................................................Data Warehousing Fundamentals 12-9

.....................................................................................................................................................Transporting Data into the Warehouse

First-Time Load and Refresh

First-Time Load The first time load (sometimes called an initial load) is a single event that occurs prior to implementation. It populates the data warehouse database with as much data as needed or available. The first-time load moves data in the same way as the regular refresh. However, the complexity of the task is made greater due to:

• Data volumes that may be very large (Your company decides to load the last five years of data, which may comprise millions of rows. The time taken to load the data may be in days rather than hours.)

• Distinct extraction and transformation tasks that are applicable only to this older data

• The task of populating all fact tables, all dimension tables, and any other ancillary tables you may have created such as reference tables

• Postprocessing of loaded data, with tasks that must work on the large data volumes, such as indexing and key generation

• Postload processing on large volumes of data, such as creating summary tables

With all the issues surrounding first time load, it is a task not to be considered lightly. You must plan, prepare, and have recovery capabilities built in to your processing routines to ensure success.

Refresh After the first time load, the refresh is performed on a regular basis according to a cycle determined by users. The cycle may be daily, weekly, monthly, quarterly, or any other business period. The refresh is a simpler task than first time load for these reasons:

• There is less fact data to load. You are moving a new snapshot of data but not all fact data into the data warehouse.

• There is no dimension data to load (unless your model has changed, which would be an exception). There may be some dimensional data changes to incorporate.

• Less-complex extraction and transformation processes may be needed. Additionally, because these processes are executed regularly, they can be monitored, tested, and improved for each refresh until they run as optimally as possible.

• Postload processing time is reduced and there is less new data to work with.

.....................................................................................................................................................12-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Building the Transportation Process

Specification

• Techniques and tools

• File transfer methods

• The load window

• Time window for other tasks

• First-time and refresh volumes

• Frequency of the refresh cycle

• Connectivity bandwidth

Copyright Oracle Corporation, 1999. All rights reserved.

Building the Transportation Process

• Test the proposed technique

• Document proposed load

• Gain agreement on the process

• Monitor

• Review

• Revise

.....................................................................................................................................................Data Warehousing Fundamentals 12-11

.....................................................................................................................................................Building the Transportation Process

Building the Transportation Process

Specifying the ProcessYou need to identify early on in the development process how you are going to move the data from the source systems into the data warehouse. You must identify:

• The data movement techniques and tools available

• File transfer methods and transfer models available

• The time available to load the data into the warehouse—the load window

• Determine whether the time window is sufficient for other tasks such as backup, preventative maintenance, and recovery, given expected performance metrics

• The volumes of data involved in the first time load and subsequent refreshes

• The frequency of the refresh cycle and the grain of the data

• Connectivity bandwidth

Testing the ProcessYou should test the proposed technique to ensure that volumes can be physically moved within the load window constraints and network capabilities.

Documenting the ProcessYou must communicate and document the proposed load with the operations organization to ensure their agreement and commitment to this important process.

Monitoring, Reviewing, and Revising the ProcessYou should ensure that the load is constantly monitored and reviewed, and revise metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and data granularity need regular revision.

.....................................................................................................................................................12-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Granularity

• Important design andoperational issue

• Space requirements

– Storage

– Backup

– Recovery

– Partitioning

– Load

• Low-level grain

– Expensive, high level of processing, more disk, detail

• High-level grain

– Cheaper, lessprocessing, less disk,little detail

.....................................................................................................................................................Data Warehousing Fundamentals 12-13

.....................................................................................................................................................Building the Transportation Process

GranularityYou have seen that the grain of the data is important in the warehouse environment.

The lower the level of granularity, the more data is loaded, and this affects the amount of time taken to load the data into the warehouse.

Low-Level Grain Low-level grain data can be expensive to build and maintain. It requires a large amount of processing power to process the details and provide answers to business queries. It takes up more disk space and could create response time problems. However the detail provides the information needed at a low level to give sophisticated business analysis.

High-Level Grain High-level grain data is easier to build and maintain than low level grain data. It requires less processing power and disk space, allows a higher number of concurrent users to access data, and performs well. However, the lack of detail and drill-down capability hinders definitive answers to business questions.

Note: The level of granularity affects not only the amount of direct access storage devices (DASD) required for warehouse data, but also the amount of space required for backup, recovery, and partitioning.

.....................................................................................................................................................12-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transportation Techniques

• Tools

• Utilities and 3GL

• Gateways

• Customized copy programs

• Replication

• FTP

• Manual

.....................................................................................................................................................Data Warehousing Fundamentals 12-15

.....................................................................................................................................................Transporting the Data

Transporting the DataNow that you have seen how to capture the data needed for the refresh, consider how to physically move the data to the warehouse server.

Transportation TechniquesThese common techniques are used to transport data into the warehouse:

• Purchased ETT tools

• Proprietary data movement utilities that use COBOL, C, or Oracle SQL*Loader, for example.

The fastest way to load large amounts of data into the warehouse is to use utilities such as SQL*Loader that can access the database directly, use networks efficiently, and run in parallel environments.

• Gateways, which may be vendor-specific or programmable, such as the Oracle Transparent Gateways

• Customized copy programs which may employ COBOL, C, PL/SQL, and FTP

To a lesser degree these are also solutions:

• Replication (database)

• File Transfer Protocol (FTP) alone

• Manual shipping of the load medium to the data warehouse site

.....................................................................................................................................................12-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transportation Technique Considerations

• Tools are comprehensive but costly.

• Data-movement utilities are fast and powerful.

• Gateways are not always the fastest method:

– Access other databases

– Supply dependent data marts

– Support a distributed environment

– Provide real-time access if needed

.....................................................................................................................................................Data Warehousing Fundamentals 12-17

.....................................................................................................................................................Transporting the Data

Transportation Technique Considerations

Purchased ETT Tools If your IT group has decided to use a customized ETT tool, then it becomes the means by which your data is transported, as well as extracted and transformed. This is not the most common option, particularly for early implementations. Often, because of the cost, copy utilities are the logical alternative.

Data-Movement Utilities Oracle implementations use SQL*Loader, which is capable of executing in parallel environments, running in a mode where server intervention is minimized and performing limited transformations, such as merging rows and changing data types. SQL*Loader is capable of loading very large volumes of data in a relatively short time, and you can use it for first-time load and refreshes successfully.

Gateways A gateway is a middleware component that presents a unified view of data coming from different data sources. Of note are Oracle Transparent Gateways (or Procedural Gateways), Open Database Connectivity (ODBC) tools, which present a uniform view of a database other than an Oracle database, or a file on specific file systems. Oracle gateways are a mixture of read-only, while other gateways are read-write.

Access to Another Vendor’s Database You should consider using gateway technology in specific instances only, and not on a regular basis. For example, using gateway technology would allow you to access a database that is not an Oracle database directly, without executing the usual extract programs. If the access is to perform a simple SQL SELECT to access data that is to be processed for the warehouse, this is faster than building a specific extract for the task.

Develop a Distributed Environment Gateway technology also gives you the ability to develop warehouses on distributed environments, employing technologies (hardware and software) that are not Oracle-specific.

Real-Time Data Access It is rare, but there are some data warehouse implementations that are updated in real time. In this situation gateway technology is useful because of the ease of executing remote queries. Consider using gateway technology for this purpose only if it is specifically requested, and you can justify it.

.....................................................................................................................................................12-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Using SQL*Loader to Load Data

SQL*Loader

Control fileInput files

Bad filesLog files

Discard files

• Fastest load mechanism

• Direct path

• Parallel and unrecoverable

• Direct-load INSERT (Oracle8)

• Direct-path load API (Oracle8i)

.....................................................................................................................................................Data Warehousing Fundamentals 12-19

.....................................................................................................................................................Transporting the Data

Using SQL*Loader to Load DataThe fastest way to load data is using SQL*Loader direct path, parallel, and unrecoverable.

Direct Path Load Direct path load is optimized for maximum data loading capability. Instead of filling a bind array buffer and creating INSERT commands, direct path loads create data blocks in Oracle database block format. The blocks are then written directly to the database. It makes calls to Oracle, but they are quick and handled at the start and end of the load process. One direct path load can occur on a table at any one time.

Direct-Path Load in Parallel You can run direct path loads in parallel. Parallel loading can load massive amounts of data in short time frames. Use the PARALLEL parameter. Note that conventional path load has the ability to perform parallel loads on the same table, just like any other program or utility that uses SQL INSERT statements.

Direct-Path Load in Parallel and Unrecoverable To avoid bottlenecks on redo logs, switch on the UNRECOVERABLE option of SQL*Loader. There is no need to write changes to redo logs in this environment.

Direct-Load INSERT In Oracle8, direct-load INSERT enhances performance during insert operations by formatting and writing data directly into Oracle data files without using the buffer cache. It has benefits over direct path load:

• Parallel load streams with a single failure do not flag the process to stop.

• The data is in Oracle format so the load does not have to convert data.

• It does not log redo information and can work in parallel.

Direct-Path Load API Oracle8i provides an application programming interface (API) to the direct-path load mechanism in the Oracle Server. This API is described on the next page.

.....................................................................................................................................................12-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Direct-Path Load API in Oracle8i

Load utility • Allows ETT and other tools toload Oracle databases efficiently

• Permits load behavior to becustomized

• Gives direct-path loadperformance

• Provides complete access to alldirect-load functionality usingOCI

.....................................................................................................................................................Data Warehousing Fundamentals 12-21

.....................................................................................................................................................Transporting the Data

Using Direct-Path Load API in Oracle8iOracle8i provides an application programming interface to the direct path load mechanism in the Oracle server. This provides a way for independent software vendors and system management tool partners to create easy-to-use and high-performance customized data-loading tools. Access to all load functionality is available through the API. Performance of any third-party data loading tool can therefore be comparable to SQL*Loader.

.....................................................................................................................................................12-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

More Transportation TechniqueConsiderations

• Use customized programs as a last resort

• Replication is limited by data-transfer rates

.....................................................................................................................................................Data Warehousing Fundamentals 12-23

.....................................................................................................................................................Transporting the Data

More Transportation Technique Considerations

Customized Programs If you are employing Oracle for your warehousing environment, SQL*Loader is recommended. Use customized programs only as a last resort.

Replication Replication is rarely used in a data warehouse environment, because of the limitations of data-transfer rates. It is normal to use SQL*Loader or in-house-developed loading techniques. If replication is used, it is more likely to be used to feed data marts from a larger warehouse.

Note: Replication is not recommended for moving large volumes of data.

.....................................................................................................................................................12-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Postprocessing of Loaded Data

Browser:http://HollywoodHollywood

XX + +

Customers:

a rec

oro f

a

s

XX + +

Customers:

Browser:http://HollywoodHollywood

Browser:http://HollywoodHollywood

XX + +

Extract Transform Transport

Loadeddata

Postprocessing of loaded data

Createindexes

Generatekeys

Summarize Filter

.....................................................................................................................................................Data Warehousing Fundamentals 12-25

.....................................................................................................................................................Postprocessing of Loaded Data

Postprocessing of Loaded DataYou have now seen how to extract data to an intermediate file store or staging area, where it is:

• Transformed into acceptable warehouse data

• Transported to the warehouse server

You have also seen how the ETT process is slightly different for:

• First-time load, which requires all data to be loaded once

• Refreshing, which requires only changed data to be loaded

You now need to consider the different tasks that might take place once the data is loaded. There are various terms used for these tasks. In this course the choice of terms is postprocessing.

The post-processing tasks are not definitive; you may or may not have to perform them, depending upon the volumes of data moved, the complexity of transformations, and the transportation mechanism. For example, it is possible to load data using SQL*Loader in a manner that excludes database trigger processing. However, at the warehouse server you do want to ensure the triggers are executed so that the integrity and validity of data are retained. This is referred to as postprocessing.

Four categories of postprocessing are explored on the following pages:

• Creating indexes

• Creating keys

• Creating summary tables

• Filtering

.....................................................................................................................................................12-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Indexing Data

• Before load: fast index reenablement

• During load: adds time to load window

• After load: adds time to load window

Index

Operationaldatabases

Stagingfile

Warehousedatabase

Copyright Oracle Corporation, 1999. All rights reserved.

Unique Indexes

• Disable constraints to load

• Enable constraints to create index

Loaddata

Disableconstraints

Enableconstraints

Create index

Catch errors Reprocess

.....................................................................................................................................................Data Warehousing Fundamentals 12-27

.....................................................................................................................................................Postprocessing of Loaded Data

Indexing Data

Before Indexing of data may occur prior to load. You can index the data values for the warehouse after data cleansing and before transportation and load. You can retrieve the data from a presorted list of values much more rapidly by reading the index, rather than performing a full-table scan. This makes it easier to reenable indexes at the server level. However, this is not done very often.

During It is possible to create the indexes at the same time as loading the data, using the usual techniques employed by the server. However, this action is a row-by-row approach to index creation, which lengthens the time to load data. In most cases the time taken is too long, and for this reason the next option is preferable.

After It is common to index after the data has been loaded into the warehouse. This adds time to the load window, but it is much faster than row-by-row processing, and you can speed up the index creation process by indexing in parallel, in a parallel environment.

Unique IndexesIf the index you are creating is an index that forces unique values in key columns with database constraints, then it is usual to load the data with the database constraints disabled, then enable the constraints. Then you build the index, which may find duplicate values and fail. Ensure that the action catches the errors so that you can correct and reindex.

Using SQL, you can employ the EXCEPTIONS INTO clause to catch errors.

.....................................................................................................................................................12-28 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Creating Artificial Keys

• Use generalized or derived keys

• Maintain the uniqueness of a row

• Use an administrative process to assign the key

• Concatenate operational key with number:

– Easy to maintain

– Cumbersome keys

– No clean value for retrieval

109908 10990801

Copyright Oracle Corporation, 1999. All rights reserved.

Creating Unique Keys for Records

• Assign a number from a list:

– No semantic meaning

– Extract operations must reference table toassign numbers

• Update metadata

• Verdict

109908 1

.....................................................................................................................................................Data Warehousing Fundamentals 12-29

.....................................................................................................................................................Postprocessing of Loaded Data

Creating Artificial KeysAn artificial (generalized or derived) key may be used to guarantee that every row in the table is unique. The warehouse data may likely be a combination of many transformed records, of which there are no natural data keys to use as unique identifiers.

Concatenate Operational Key with a NumberYour postprocessing program executes the create index commands and allocates the key values, which may be a concatenation of the primary key and version digit or characters.

For example, if a customer record key value contains six digits, such as 109908, the derived key may be 10990801. The last two digits are the sequential number generated automatically.

Advantage The advantage of this method is that it is relatively easy to maintain and set up the necessary programs to manage number allocation.

Disadvantage The disadvantages of this method are that

• The keys may become long and cumbersome.

• There is no clean key value for retrieval of a record, unless you have another copy of the key. For example, if the operational Customer_Id is 109908 but the warehouse key is now 10990801, then extracting information about that customer from the warehouse using 109908 is impossible—unless the old value has been retained in another field such as:

Customer_key Customer_id Customer_Name

10990801 109908 Acme Inc.

Assign a Number from a ListYou can also assign the key sequentially from a simple list of numbers. A disadvantage of this method is that the keys therefore have no semantic or intuitive meaning.

MetadataYou must ensure the metadata is updated to register the latest key allocations.

VerdictThe option you choose depends upon the extract methods, the tools available, and the hardware and network capability and availability.

.....................................................................................................................................................12-30 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Creating Summary Tables

• CTAS

• pCTAS Summary data

Copyright Oracle Corporation, 1999. All rights reserved.

Filtering Data

From warehouse to data marts

• CTAS

• pCTAS

Summary data

Warehouse

Data marts

.....................................................................................................................................................Data Warehousing Fundamentals 12-31

.....................................................................................................................................................Postprocessing of Loaded Data

Creating Summary TablesThis course has already discussed why summary tables are useful to the data warehouse.

• They provide immediate answers to queries, which improves query performance.

• They save disk space. You can create summary data for old history for which detailed analysis is not required.

After you perform initial user requirements analysis, you determine the summaries needed by the user. However, you must constantly monitor access, from which you may be able to determine new summaries that should be created and summaries no longer needed.

You can create summaries by using:

• CREATE TABLE AS SELECT (CTAS), or

• CREATE TABLE AS SELECT... PARALLEL (pCTAS)

Filtering DataYou may filter out specific information to supply subject-specific data for dependent data marts. The filtering uses simple SQL to create new objects using existing objects. The new objects are then moved into the data mart, similar to the way data is moved into the warehouse.

You can perform this filtering task using:

• CREATE TABLE AS SELECT (CTAS), or

• CREATE TABLE AS SELECT... PARALLEL (pCTAS)

.....................................................................................................................................................12-32 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Verifying Data Integrity

• Load data into intermediate file

• Compare target flash totals with totals before load

Load

File 1

File 2

Flashtotals

Countsand

amounts

Intermediatefile

=

!=File

2

File 1

Load

Warehouse Preserve,inspect,

fix, then load

.....................................................................................................................................................Data Warehousing Fundamentals 12-33

.....................................................................................................................................................Postprocessing of Loaded Data

Verifying Data IntegrityIt is important at all stages of ETT that errors be detected, flagged, and resolved. How you verify data integrity depends upon whether you have a customized approach to ETT or whether you employ an ETT tool, which will probably deal with these issues automatically, and only allow you to visibly access the data when available in the warehouse.

It is important to ensure that each load, whether first time or a refresh, executes successfully. You need to create jobs that track:

• The status of the warehouse load, whether it has started, is in progress, or complete

• When the process completes

• Statistics to show load start and complete time, and records processed in order to monitor and ensure continuing efficiency

• Comparison of load control counts and amounts:

– You must be aware of the amounts of data that are to be loaded, so that you can perform an accurate validation of completeness.

– You can load the detail and summary records into intermediate files, to compare counts and amounts created before loading with counts and amounts (flash totals) derived on the target data warehouse.

• Data reconciliation issues

• Referential integrity violations

• Any failures that require reprocessing

.....................................................................................................................................................12-34 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Steps for Verifying Data Integrity

Source files

Control

ExtractSQL*Loader

.bad

1

2

3

4

56

7

.log

.....................................................................................................................................................Data Warehousing Fundamentals 12-35

.....................................................................................................................................................Postprocessing of Loaded Data

Steps for Verifying Data IntegrityYou may find it useful to load the detail and summary records into intermediate files, so that you can compare record counts and sample totals before loading on the target data warehouse. If the counts and totals do not match, you must preserve and inspect the intermediate files without loading and compromising data warehouse data integrity.

Example In the diagram, you see that the source data is coming in from a number of files.

1 The control and extract process queries and downloads the data, and appends a row (either a row count value or a phony row of unique data).

2 The process generates a report indicating the data extraction information, such as the number of rows downloaded, the number of bytes in the file, and the query statement.

3 The process puts the extracted data into a flat file.

4 SQL*Loader loads the data into a database table.

5 The conversion and loading process generates a loader log to track the same type of information as the extract report: the number of rows downloaded, the number of bytes contained in the file, and conversion details.

6 At the end of the load process, the SQL*Loader script removes the last record of the flat file and puts it into a filename.bad file, which contains the row count or phony record of data that was added by the extraction process.

7 A UNIX script compares the mainframe report and the loader log to see if they contain the same information. The script may also look at the.bad file to determine if the correct last row of data was removed from the loading process. If the reports match and the data in the.bad file is correct, then the loading process is deemed successful.

If you are writing a custom mechanism, embed a set of rows into the data so that verification is easier. You can query for the embedded data to see that all rows are loaded. Your routine may also display messages, which are embedded in the load routine, or send an e-mail.

.....................................................................................................................................................12-36 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Standard Quality Assurance Checks

• Load status

• Completion of the process

• Completeness of the data

• Data reconciliation

• Violations

• Reprocessing

• Comparison of counts and amounts

11 3+ =

.....................................................................................................................................................Data Warehousing Fundamentals 12-37

.....................................................................................................................................................Postprocessing of Loaded Data

Standard Quality Assurance ChecksThe following tasks are standard quality assurance checks for the data loaded into the warehouse:

• Status of the warehouse load

• Completion of the load process

• Completeness of the data

• Data reconciliation

• Referential integrity violations and reprocessing

• Comparison of load control counts and amounts

.....................................................................................................................................................12-38 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Summary

This lesson discussed the following topics:

• First-time load considerations

• Techniques for transporting data

• Tasks involved in the postload processing stage

.....................................................................................................................................................Data Warehousing Fundamentals 12-39

.....................................................................................................................................................Summary

SummaryThis lesson discussed the following topics:

• Tasks involved with first-time loading of data into the warehouse

• Techniques for transporting data

• Tasks involved in the postload processing stage

.....................................................................................................................................................12-40 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Practice 12-1 Overview

This practice covers the following topics:

• Identifying a series of statements as true or false

• Answering a series of questions

.....................................................................................................................................................Data Warehousing Fundamentals 12-41

.....................................................................................................................................................Practice 12-1

Practice 12-11 Assemble into small groups of 3 or 4. Discuss and compare the factors that will

determine the load window where you work. Consider user requirements, operational constraints, and staffing issues.

2 Identify whether the following statements are true or false.

3 Name the two different types of data loading.

_____________________

_____________________

4 Name four methods of moving data to the warehouse server.

_____________________

_____________________

_____________________

_____________________

5 What SQL command is used to create summary tables on the data warehouse server?

________________________________________________________________

6 What server technique can be used to prevent and allow access to data in the warehouse after refresh?

________________________________________________________________

Question True FalseTransportation of data involves moving the data into the data warehouse database.An example of high level grain data is summarized data.SQL*Loader is the fastest way to move data into the data warehouse database.Gateways are useful for moving large amounts of data into the warehouse.Data for the data warehouse is always indexed after it is loaded.The quickest way to create unique indexes on warehouse data is to leave database constraints enabled on load.Summary tables are created on the warehouse server.Filtering removes unwanted records from staging files.

.....................................................................................................................................................12-42 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 12: Transportation: Loading Warehouse Data

.................................

13

Transportation:Refreshing Warehouse

Data

.....................................................................................................................................................13-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Overview

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building

theWarehouse)

ETT(Building

theWarehouse)

Meeting aBusiness

Need

SupportingEnd UserAccess

Managing the Data

Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Objectives

After completing this lesson, you should be able todo the following:

• Describe methods for capturing changed data

• Explain techniques for applying the changes

• Discuss techniques for purging and archiving data

• Outline final tasks, such as publishing the data,controlling access, and automating processes

• List tools for transporting data into the warehouse

.....................................................................................................................................................Data Warehousing Fundamentals 13-3

.....................................................................................................................................................Overview

OverviewIn the last lesson, you examined the first time load of the warehouse. In this lesson, you examine methods for updating the warehouse with changed data, after the first time load.

Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page.

ObjectivesAfter completing this lesson, you should be able to do the following:

• Describe methods for capturing changed data

• Explain techniques for applying the changes

• Discuss techniques for purging and archiving data

• Outline final tasks, such as publishing the data, controlling access, and automating processes

• List tools for transporting data into the warehouse

.....................................................................................................................................................13-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Developing a Refresh Strategyfor Capturing Changed Data

• Consider load window

• Identify data volumes

• Identify cycle

• Know the technical infrastructure

• Plan a staging area

• Determine how to detect changes

T1 T2 T3

Operationaldatabases

Copyright Oracle Corporation, 1999. All rights reserved.

User Requirements and Assistance

• Users define the refresh cycle

• IT balances requirements against technical issues

• Document all tasks and processes

• Employ user skills

T1 T2 T3

Operationaldatabases

.....................................................................................................................................................Data Warehousing Fundamentals 13-5

.....................................................................................................................................................Capturing Changed Data

Capturing Changed DataYou must have a strategy for maintaining changes to the data warehouse, including changes to facts, dimension data, and summary data.

There are no concrete rules about when the data warehouse should be refreshed, but there are several factors to consider:

• Total load window available

• The volume of data to be transferred

• How often does the warehouse data need to be updated? When are you going to move the data? Will you refresh monthly, weekly, or at another time interval? Will you use continuous refresh for nearly real-time data?

• The connectivity gear available for moving the data into the data warehouse. How are you going to move the data? Will you move data in batch mode, which is feasible for less time-critical applications?

• Are you going to move data from operational systems to an intermediate area? Is this area an operational data store? Is it a flat file? Is it an Oracle database? Or is it something completely unique to your implementation?

• How are changes in data to be detected? Are you going to push the changes through when detected? Are you going to pull the changes in? Where are you going to store the changes? Could you use triggers to force changes into an alternative store?

User Requirements and AssistanceThe strategy is primarily defined by user requirements, but they must be balanced against the available technology and windows for loads. All must be documented and understood by everyone involved in the project. The users can also provide expertise for load verification, validation, run-to-run, and load controls.

.....................................................................................................................................................13-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Load Window

• Time available for entire ETT process

• Plan

• Test

• Prove

• Monitor

0 3 am 6 9 12 pm 3 6 9 12

User Access PeriodLoad Window Load Window

Copyright Oracle Corporation, 1999. All rights reserved.

Load Window

• Plan and build processes according to a strategy.

• Consider volumes of data.

• Identify technical infrastructure.

• Ensure currency of data.

• Consider user access requirements first.

• High availability requirements may mean a smallload window.

0 3 am 6 9 12 pm 3 6 9 12

User Access Period

.....................................................................................................................................................Data Warehousing Fundamentals 13-7

.....................................................................................................................................................Capturing Changed Data

Load WindowThe load window is simply the amount of time you have available to extract, transform, load, postload process data, and make the data warehouse available to the user. The load performs many sequential tasks that take time to execute.

You must ensure that every event that occurs during the load window is planned, tested, proven, and constantly monitored. The effect of poor load performance is to extend the load time and prevent users from accessing the data when it is needed. Careful planning, defining, testing, and scheduling is critical.

Load Window Strategy The load time is dependent upon a number of factors, such as data volumes, network capacity, and load utility capabilities. You must not forget that the aim is to ensure the currency of data for the users, who require access to the data for analysis. To work out an effective load window strategy, consider the user requirements first, and then work out the load schedule backward from that point.

Determining the Load Window It is usual to define the user access requirements first and work the load schedule backward from that point. Once the user access time is defined, you can establish the load cycles. Some of the processes overlap to enable all processes to run within the window.

More realistically, almost twenty-four-hour access is required. This means the load window is significantly smaller than the example shown here. In that event, you need to consider how to process the load and keep users presented with current realistic data. This is where you can use partitioning strategies.

.....................................................................................................................................................13-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Scheduling the Load Window

0 3 am

1

File 1

File 2

Receivedata

Control File File names File types Number of files Number of loads First-time load or refresh Date of file Date range Records in file - counts Totals - amounts

FTP

Controlprocess

4

Openand read filesto

verifyand

analyze

32

Requirements

Load cycle

.....................................................................................................................................................Data Warehousing Fundamentals 13-9

.....................................................................................................................................................Capturing Changed Data

Scheduling the Load WindowFrom the example you can see that the transportation of data (that is, moving the data to the server and loading into the warehouse tables) is a complex task involving many steps.

To work out an effective load window strategy, consider the user requirements first, and then work out the load schedule backward from that point.

Example of Scheduling the Load Window1 Determine when the users require the data. If the working hours are between 9 a.m.

and 5 p.m., you allow them access during that period.

2 Once the user data-access time is defined, you can establish the load cycle. The load cycle may need to access different extract files, or a different number of extract files, each time the load is performed. You may need to split the cycle into a series of loads using one file at a time.

3 You create a control file to manage every load, or series of loads. Remember that the first-time load is different from refreshes, and that for each refresh the files and number of files may differ.

The control file contains information such as the:

– File name and type

– Date of the file

– Number of records in the file

– Date range for the data in the file

– Counts of records and totals so that the data load can be verified

4 The control process is an active process that waits for the files named in the control file to be received. The number and names of these files vary among loads. Files are usually transferred using File Transfer Protocol (FTP) techniques. The control process does not pass to any other process until all files are received and it has opened and read count and amount data to be used for load verification and analysis.

Note: The time 0 identified on the slides denotes 00:00 Zulu, which is midnight.

.....................................................................................................................................................13-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Scheduling the Load Window

3 am 6 am 9 am

Load intowarehouse

File 1

File 2

5

Verify,analyze,reapply

6

Createsummaries

8

7Index data

Updatemetadata

9

Parallelload

Copyright Oracle Corporation, 1999. All rights reserved.

Scheduling the Load Window

6 am 9 am

Createviews for

specializedtools

11

10

Back upwarehouse

Usersaccess

summarydata

12Publish

13

User access

.....................................................................................................................................................Data Warehousing Fundamentals 13-11

.....................................................................................................................................................Capturing Changed Data

Example of Scheduling the Load Window (continued)5 The data is then loaded into the warehouse.

6 Each load requires verification and analysis (and maybe reanalysis once any load exceptions are reapplied). You need to ensure that the data is successfully loaded by performing checks against the row counts and amounts available in the control files.

Any loading errors yielding potentially bad data need to be reapplied. This adds time to the load, and contingency should be built into the cycle to cope with this. If you are using SQL*Loader to move the data, the bad data resides in a file called <filename>.bad.

7 Indexes are constructed.

8 Summarization takes place.

9 Metadata is updated to ensure it contains information about the current load.

10 The warehouse is backed up. With many database servers today, there are typically two mechanisms for backup: hot, with users online, and cold, with users offline. You should consider cold backups before user access. The backup should include:

– All warehouse data

– Summary tables

– Database schema

– Metadata

Note: If the information is supplied to the warehouse on tape, a full cold backup may not be necessary. The summaries created at the target server may be all that you need to back up.

11 Create the views required by specialized user tools, such as Oracle Express RAM/RAA.

12 Give users access to the summary data.

13 Publish information to the users, specifying the changes to the data warehouse and allowing them access.

Note: These steps identify one solution and assume that summarization and indexing occur after load, and that the job is executed from a batch file.

.....................................................................................................................................................13-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Capturing Changed Data for Refresh

• Capture new fact data

• Capture changed dimension data

• Determine method for capture of each

• Methods:

– Wholesale data replacement

– Comparison of database instances

– Time stamping

– Database triggers

– Database log

• Hybrid techniques

.....................................................................................................................................................Data Warehousing Fundamentals 13-13

.....................................................................................................................................................Capturing Changed Data

Capturing Changed Data for RefreshThere are two major categories of changed data:

• New fact data

• Changed dimension data

For each, a different capture mechanism will be discussed.

In addition, consider how you will process the load. The fact might easily be loaded by adding another partition of data, a relatively straightforward process (for a database administrator).

Changes to dimension data need more selective update. You need to evaluate whether the change is to replace or add to an existing record, or whether you want to maintain history (keeping old and new records).

• For example, the description of a product may change over its lifetime, even if its primary (and unique) part number remains the same. It is important to see that change reflected.

• Another common example is sales districts in a sales organization that reorganizes.

MethodsThere are a number of ways to capture changes to data. Consider which is the most efficient for your individual circumstances:

• Wholesale data replacement

• Comparison of database instances

• Time and date stamping

• Database triggers

• Database log

Note: All methods identified here are possible with Oracle server and associated facilities and utilities.

.....................................................................................................................................................13-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

• Expensive

• Limited historical data, if any

• Data mart implementations

• Time period replacement

Wholesale Data Replacement

T1 T2 T3

Operationaldatabases

Copyright Oracle Corporation, 1999. All rights reserved.

Comparison of Database Instances

Databasecomparison

Yesterday’soperationaldatabase

Delta file holdschanged data

• Simple to perform, but expensive in time andprocessing

• Delta file:

– Changes to operational data since last refresh

– Used by various techniques

Today’soperationaldatabase

.....................................................................................................................................................Data Warehousing Fundamentals 13-15

.....................................................................................................................................................Capturing Changed Data

Wholesale Data ReplacementThis method refreshes the entire warehouse in every business cycle. This method is understandably very expensive. Every refresh needs to extract, transform, and transport the entire warehouse. In fact, this method is similar to using a first-time load on a regular basis.

Some data mart and online analytical processing server implementations use this method because they hold less data (a subset of the data warehouse), and wholesale replacement is less complex and less expensive than programming mirroring and update procedures.

Issues The time window required for wholesale replacement can often exceed the time that the data is contracted to be offline (and unavailable to the users). However, with mirroring strategies users can be directed to an image copy of the data warehouse while maintenance is being performed. The changes that occur during the maintenance cycle must be applied to the current online image (production version). The production version should then be backed up or mirrored.

Historical data analysis is limited, because you are restricted by the sheer volume of data loaded each time.

Comparison of Database InstancesIn this method, you capture the differences between two instances of the same database, to find out the changes that have occurred since the last time the data warehouse was refreshed. The changes are held in an intermediate (or delta) file and are used to update the warehouse.

Issues It is a simple but an expensive way to determine changes. It works more efficiently and effectively if the volumes of data are small, as with wholesale replacement.

Delta File or Database The delta database (or file) contains only the changes that have been made to the operational system since the last refresh. An operational application may need to be modified to create the delta file structure and contain the new logic that captures changes and adds the rows to the delta file.

.....................................................................................................................................................13-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Time and Date Stamping

• Fast scanning for records changed since lastextraction

• Date Updated field

• No detection of deleted data

Operationaldata

Delta file holdschanged data

Copyright Oracle Corporation, 1999. All rights reserved.

Database Triggers

• Changed data intersected at the server level

• Extra I/O required

• Maintenance overhead

Operationalserver

(DBMS)

Triggers on server

Trigger

Trigger

TriggerOperational

dataDelta file holdschanged data

.....................................................................................................................................................Data Warehousing Fundamentals 13-17

.....................................................................................................................................................Capturing Changed Data

Time and Date StampingA time and date stamp on changed data quickly shows you the data that has been changed since the last refresh cycle. The time and date stamp is normally part of a key value, making it an efficient way to search and find changed data.

The advantage of this approach is that the process that creates the delta database only needs to look at the time key and identify the records with the required time and date stamp. Depending upon the frequency of refresh and the mechanism chosen for time and date stamping, the search for the time value may be a specific date, for example, all Time_Key = ‘01-jan-97’, or a date range such as Time_Key BETWEEN ‘01-jan-97’ and ‘07-JAN-97’, or Time_Key LIKE ‘%jan-97’.

Issues You can use this method only if the database contains a Date Updated field, which may not be the case in many operational systems. This is one issue that may be resolved by reengineering source system applications or database server code. You might add database triggers to perform the updates.

Note: Time and date stamps do not catch deleted data.

Database TriggersProcedural code in database triggers captures and identifies changed data at the database level. Extra I/O is required while the system is online to track changes as they occur and maintain a delta file if needed.

Issues You must modify the database to add server (DBMS) triggers that capture before and after images of the records.

The triggers and associated code—PL/SQL, if using Oracle—write the changes to a delta database or file. Of course, to use this method, the server must support database triggers.

.....................................................................................................................................................13-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Using a Database Log

• Contains before and after images

• Requires system checkpoint

• Common technique

LogLog analysis

anddata extraction

Operational data

Delta file holdschanged data

Operationalserver

(DBMS)

Copyright Oracle Corporation, 1999. All rights reserved.

Verdict

• Consider each method on merit.

• Consider a hybrid approach if one approach is notsuitable.

• Consider current technical, existing operational,and current application issues.

.....................................................................................................................................................Data Warehousing Fundamentals 13-19

.....................................................................................................................................................Capturing Changed Data

Using a Database LogA log file contains information from which you can extract changed data; it logs “before” and “after” images of the data. You may analyze the log file in batch mode to identify the differences that become the delta file.

Issues• The format of the log file may be difficult to interpret and use.

• The log tape is not really intended for use by the warehouse, and often contains a lot of data not required by the warehouse.

• The system must wait for a checkpoint in order to get a stable log.

This is a process that many ETT tools use, but it can be done only on databases that provide a log, such as Oracle and DB2.

Note: Oracle snapshot and replication facilities log changes into another table.

VerdictEach of these mechanisms has its good and bad points. In reality, your data warehousing environment might actually use a combination of these mechanisms. For example, you might:

• Time-stamp changed dimension data, and

• Simply extract data that exists within a database partition for the new fact data, but use

• Wholesale replacement to supply your dependent data marts with updated data.

The choice you make is based on the many factors identified earlier in this lesson.

.....................................................................................................................................................13-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Applying the Changes to Data

You have a choice of techniques:

• Overwrite a record

• Add a record

• Add a field

• Maintain history

• Add version numbers

Copyright Oracle Corporation, 1999. All rights reserved.

Overwriting a Record

Customer Id John Doe Single

...................................................................,

...............................................................,....

Customer Id John Doe Married

......................................................................

......................................................................

• Easy to implement

• Loses all history

• Not recommended

.....................................................................................................................................................Data Warehousing Fundamentals 13-21

.....................................................................................................................................................Capturing Changed Data

Applying the Changes to DataThere are a number of methods for managing changes to existing data in dimension tables:

• Overwrite a record

• Add a new record

• Add a current field

• Maintain history records

• Versioning of records

Overwriting a RecordThis method is easy to implement, but it is useful only if you are not interested in maintaining the history of data. If the data you are changing is critical to the context of information and analysis of the business, then overwriting a record is to be avoided at all costs.

For example, by overwriting dimension data, you lose all track of history—you can never see that John Doe was single if the value “Single” is overwritten with the value “Married” from the operational system. The Customer_Id for John Doe remains constant throughout the life of the warehouse, because only one record for John Doe is stored.

.....................................................................................................................................................13-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Adding a New Record

1 Customer Id John Doe Single

1 Customer Id John Doe Single

1A Customer Id John Doe Married

• History is preserved; dimensions grow.

• Time constraints are not required.

• Generalized key is created.

• Metadata tracks usage of keys.

Copyright Oracle Corporation, 1999. All rights reserved.

Adding a Current Field

Customer Id John Doe Single

Customer Id John Doe Single Married 01-JAN-96

• Maintains some history

• Loses intermediate values

• Is enhanced by adding an Effective Date field

.....................................................................................................................................................Data Warehousing Fundamentals 13-23

.....................................................................................................................................................Capturing Changed Data

Adding a New RecordUsing this method, you add another dimension record for John Doe. One record shows that he was “single” until December 31, 1995, another that he was “married” from January 1, 1996. Using this method history is accurately preserved, but the dimension tables get bigger.

• A generalized (or artificial) key is created for the second John Doe record.

• The generalized key is a derived value that ensures that a record remains unique. However, you now have more than one key to manage.

• You also need to ensure that the record keeps track of when the change occurred.

The Customer_Id for John Doe does not remain constant throughout the life of the warehouse, because each record added for John Doe contains a unique key. The key value is usually a combination of the operational system identifier with characters or digits appended to it.

Consider using real data keys. The example here shows a method that is commonly identified in warehouse reference material.

Adding a Current FieldIn this method, you add a new field to the dimension table to hold the current value of the attribute. Using this method, you can keep some track of history. You know that John Doe was “single” before he was “married”. Each time John’s marital status changes, the two status attributes are updated and a new Effective Date is entered.

However, what you cannot see from this method is what changes have taken place between the two records you are storing for John Doe—intermediate values are lost.

• Consider using an Effective Date attribute to show when the status changed.

• Partitioning of data can then be performed by effective date.

The method you choose is again determined by the business requirements. If you want to maintain history, this method is a logical choice that can be enhanced by using a generalized key.

.....................................................................................................................................................13-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Limitations of Methods for ApplyingChanges

• Complete history impossible

• Dimensions may grow large

• Maintenance overhead1234 Comer 1 Main Street 555-67891234 Comer 200 First Ave 222-3211

1234 Comer 1 Main Street 555-6789123401 Comer 200 First Ave 222-3211

Effective Date1234 Comer 1 Main Street 555-6789 01-Apr-93123401 Comer 200 First Ave 222-3211 01-Jun-97

.....................................................................................................................................................Data Warehousing Fundamentals 13-25

.....................................................................................................................................................Limitations of Methods for Applying Changes

Limitations of Methods for Applying ChangesAssume a customer record as follows:

If you overwrite the record, history is lost, and there is no record of this company ever existing at 1 Main Street.

You may add a record and create a generalized key to identify the row uniquely. However, this method may make the dimension large and unmanageable and you have lost that customer’s unique identifier.

You also have to duplicate the fields for this customer that have not changed into the record with the new generated key, which adds to the maintenance burden.

You may add a current field and create a generalized key to uniquely identify the row:

In this situation, you know that 200 First Ave. is the current address, but you have no way of knowing the previous address details.

Custid Name Address Phone1234 Comer 1 Main Street 555-6789

Custid Name Address Phone1234 Comer 200 First Ave 222-3211

Custid Name Address Phone1234 Comer 1 Main Street 555-6789123401 Comer 200 First Ave 222-3211

Custid Name Address Phone Effective Date123401 Comer 200 First Ave 555-6789 01-jun-97

.....................................................................................................................................................13-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Maintaining History

Product

Time

Sales

HIST_CUST

CUSTOMER

One-to-many relationship

• Always retain current record

• Consistently able to refer to record history

.....................................................................................................................................................Data Warehousing Fundamentals 13-27

.....................................................................................................................................................Limitations of Methods for Applying Changes

Maintaining HistoryAnother alternative is to use history tables, which involve normalizing the dimensions to hold current and historical data. Oracle consultants engaged in data warehouse implementations have found this method to be a more comprehensive, effective, and easily managed solution.

One-to-Many RelationshipUsing this method, you keep one current record of the customer and many history records in the customer history table (a one to many relationship between the tables), thus maintaining history in a more normalized data model. The table below shows you how the data might appear.

In the CUSTOMER table the customer operational unique identifier is retained in the CUSTOMER.Id column. In the HIST_CUST table, the operational key is maintained in the HIST_CUST.Id column and the generalized key in the HIST_CUST.G_id column. This enables you to keep all the keys needed and multiple records for the customer.

The CUSTOMER table may contain full details for each customer; however, it could contain only the key values, leaving the full details (including text descriptions) in the HIST_CUST table.

CUSTOMER. Id HIST_CUST. Id HIST_CUST. G_id1234 1234 1234

1234A1234B

4567 4567 45675678 5678 5678

5678A5678B

.....................................................................................................................................................13-28 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

History Preserved

• History enables realistic analysis.

• History retains context of data.

• History provides for realistic historical analysis.

• Model must be able to:

– Reflect business changes

– Maintain context between fact and dimensiondata

– Retain sufficient data to relate old to new

.....................................................................................................................................................Data Warehousing Fundamentals 13-29

.....................................................................................................................................................Limitations of Methods for Applying Changes

History PreservedThis method completely preserves history and is therefore very effective for performing analysis over time where data has changed substantially. The context of information is still preserved. A good example of where this applies is in a sales organization.

Assume that you have a model containing a sales fact and dimensions such as Customer, Sales Region, and Product.

Your warehouse contains sales figures for sales region Europe for the years 1992 and 1993. In 1994, the European region reorganizes and splits into East Europe and West Europe. The warehouse is now maintaining data for each region from 1994 onward.

In 1997, users are asked to put together some projections based on the last five years’ sales in Europe. The data you are currently using for East and West Europe for 1992 and 1993 does not have the data split this way. That is not an issue because you still have the ability to roll up East and West regions into a total for Europe, and perform analysis over a five-year period.

If we reverse the scenario, two regions become one and the solution is the same.

The issue with retaining history and context is building a model that is able to:

• Reflect changes as the business changes

• Keep the context of information accurate between dimension and fact data

• Retain sufficient data to be able to relate old and new records where needed

.....................................................................................................................................................13-30 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Version Numbering

• Avoid double counting

• Facts hold version number

Customer.CustId Version Customer Name1234 1 Comer1234 2 Comer

Sales.CustId Version Sales Facts1234 1 11,0001234 2 12,000

Customer

Sales

Product

Time

.....................................................................................................................................................Data Warehousing Fundamentals 13-31

.....................................................................................................................................................Limitations of Methods for Applying Changes

Version NumberingYou can also maintain a version number for the customer in the Customer dimension:

You must ensure that the measures in the fact table, such as sales figures, also contain the customer version number to avoid any double counting:

For Comer Version 1, the sales total is $36,000.

For Comer Version 2, the sales total is $87,000.

Custid Name Address Version1234 Comer 1 Main Street 11234 Comer 200 First Ave 2

Custid Version Sales $1234 1 11,0001234 2 12,0001234 1 5,0001234 1 10,0001234 2 45,0001234 2 30,0001234 1 10,000

.....................................................................................................................................................13-32 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Purging and Archiving Data

• As data ages, its value depreciates.

• Remove old data from the warehouse:

– Archive for later use

– Purge without copy

.....................................................................................................................................................Data Warehousing Fundamentals 13-33

.....................................................................................................................................................Purging and Archiving Data

Purging and Archiving DataData may reside in the warehouse for many more years than it would in an operational system; however, it does not remain forever. The value of data to the business diminishes over time.

During analysis, the analysts determine the useful life span of the data. In addition, old data may simply be summarized; the detail is not needed.

What Is Purge?If there is no chance of ever needing the data again, even for summaries, then you can purge it. This removes the data entirely; no copy is retained.

What Is Archive?If you feel you may need the data in the future—to build summaries, for example—then archive the data to low-cost storage devices that are not associated with the data warehouse.

Your RoleYou need to ensure that you have the strategies in place that meet determined business requirements for purge and archive.

.....................................................................................................................................................13-34 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Techniques for Purging Data

• TRUNCATE: Retains no rollback

• DELETE: Retains redo and rollback

• ALTER TABLE: Removes a partition

• PL/SQL: Uses database triggers

.....................................................................................................................................................Data Warehousing Fundamentals 13-35

.....................................................................................................................................................Purging and Archiving Data

Techniques for Purging Data

TRUNCATE Command The SQL TRUNCATE command is the quickest way to purge data. It does not retain redo data and rollback is impossible. It is also useful for emptying a temporary table that is used repeatedly as part of a regular load or summary process. Indexes on the table are also truncated.

DELETE Command The SQL DELETE command is used if the data has not been partitioned. DELETE retains redo information, so you need to size the rollback segments carefully. NOLOGGING does not apply to DELETE or UPDATE. DELETE works only in parallel on partitioned tables. Oracle8 syntax enables you to delete rows from a partition.

When you delete rows from a table, the corresponding entries in every index on the table must also be deleted. This has a performance impact.

ALTER TABLE Command Given that your warehouse data is commonly partitioned by time, you can simply remove a partition containing old data.

PL/SQL Triggers Where there are special requirements and low volumes of data, you can use PL/SQL and the ON DELETE database trigger. This is, however, an expensive option.

.....................................................................................................................................................13-36 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Techniques for Archiving Data

• Export to dump file from tables

• Import to tables from dump file

• ALTER TABLE EXCHANGE partitions

Database

EXP

.dmpIMP

Copyright Oracle Corporation, 1999. All rights reserved.

Verdict

• Defined by business requirements

• Must be managed

.....................................................................................................................................................Data Warehousing Fundamentals 13-37

.....................................................................................................................................................Purging and Archiving Data

Techniques for Archiving Data

Import and Export Utilities The export utility enables you to move data from tables to a dump file (called filename.dmp). The import utility can then read that dump file and load data back into the same or another user.

You can export in two ways:

• A conventional path export uses a conventional SELECT statement to extract table data which is held for a short time in an evaluation buffer. Once evaluated, it is transferred to the dump file.

• A direct path export does not use the evaluation buffer.

ALTER TABLE You can also switch a partition of data with an empty table, drop the empty partition, and export the table. Archive the exported table when you have time.

VerdictThe method you employ depends upon your individual business requirement, although the history model is a popular choice in the current warehousing environment. You must ensure that someone in the data warehouse administration is responsible for managing and tracking these changes.

.....................................................................................................................................................13-38 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Final Tasks

• Update metadata

– ETT

– User

• Publish data

– Availability

– Changes

– Subject area basis

• Use database roles to prevent and allow access

Browser:http://

HollywoodHollywood XX + +

Customers:

a re c

oro f

a

s

XX + +

Customers:

Browser:http://

HollywoodHollywood

Browser:http://

Hol lywoodHol lywoodXX + +

SourcesExtract

StageTransform

RulesLoad

PublishQuery

.....................................................................................................................................................Data Warehousing Fundamentals 13-39

.....................................................................................................................................................Final Tasks

Final Tasks

Update the MetadataOnce your data has been loaded successfully, ensure that the metadata is updated. You need to consider many aspects, including information about the processes themselves. The most important aspect at this time is to ensure that the metadata reflects the new information available. Users must be made aware of the changes, for example, of the validity of data, date of data, any new data available, revised summaries, removed summaries, new algorithms, and the new meaning of values.

Publish DataSo that users are presented with a consistent view of the data, ensure that user access is denied while the ETT processes are executing. You should allow access only when all tasks are complete, validation has occurred, and metadata updated.

You may choose to do this on a subject area basis, user basis, or for the entire warehouse. Again, like many other tasks, this is dependent upon your individual data warehouse or data mart implementation.

Accessing the Refreshed Warehouse With Oracle, using roles and granting and revoking privileges is the simplest method of preventing and allowing access.

You may advise the users that the warehouse is available by internal e-mail mechanisms. Alternatively, if you have strict service level agreements (SLAs) that state users must have access from, say, 8:00 a.m. every working day, then advice may not be needed. You could e-mail or advise only if the warehouse is not available, because of some unforeseen problems.

.....................................................................................................................................................13-40 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Publishing Data

• Control access using database roles

• 24-hour operation may be requested

• Compromise between load and access

• Consider

– Staggering updates

– Using temporary tables

– Using separate tables

.....................................................................................................................................................Data Warehousing Fundamentals 13-41

.....................................................................................................................................................Final Tasks

Publishing DataThe term “publishing data” is used to describe when the data is loaded and made available to the users.

As a rule you prevent access to the data while the load process is active, to ensure that the users are presented with an accurate view of data and summaries.

If service level agreements state that users require access virtually 24 hours a day, then revoking and granting access as discussed is not appropriate. You need to consider how you can perform the load action while still allowing access, and ensuring that the data is as consistent as possible.

There are different techniques depending upon the availability needs of the users.

• Stagger the updates to the different subject areas. Update on different nights of the week (say Tuesday and Wednesday) even though the revised source data might be made available days earlier.

• Use temporary tables (that the users cannot access) for load, filtering, summarizing. Make the database unavailable only for the short time it takes to instantiate these as permanent objects.

• Load the data into a separate table and perform all the processing required. These actions are invisible to the user. Then when all tasks are complete, swap the contents of the temporary table into a database partition. The same technique is employed for the indexes.

Note: With Oracle7, the partition is a view. In Oracle8, this is a partitioned table.

.....................................................................................................................................................13-42 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

ETT Tool Selection Criteria

• Overlap with existing tools• Availability of meta model• Supported data sources• Ease of modification and maintenance• Required fine tuning of code• Ease of change control• Power of transformation logic• Level of modularization• Power of error, exception, resubmission

features• Intuitive documentation• Performance of code

.....................................................................................................................................................Data Warehousing Fundamentals 13-43

.....................................................................................................................................................Selecting ETT Tools

Selecting ETT ToolsConsultants in the field suggest that the selection criteria for ETT tools include the following considerations:

• The overlap with existing tools used in the warehouse development, such as Oracle Designer or other modeling tools

• The availability of the metamodel to other tools or the use of the metamodel from other tools

• The breadth of data sources supported and target data coverage, such as flat files, character formats, and database types

• The mechanism for and ease of defining and altering rules when there is possibly a mixed set of users managing ETT, such as analysts and endusers

• The requirement to maintain generated code manually

Some vendors advise there is no need to modify the generated code; however, you may need to fine-tune it. Do you have the in-house expertise to modify the generated code, for example, C or COBOL?

• The control of changes to transformation rule definitions and the ability to handle development and production versions of transformation rules

• The depth, power, and ease of use for the transformation logic; for example, conditional logic, data value filters, row and set-oriented processing, local variables, and input parameters

• The reuse and modularization of existing transforms and filters

• Error reporting, rejected records, and resubmission capabilities

You need to be able to trap and correct bad data before it is loaded into the warehouse and report corrections to the source system afterward.

• The self-documenting ability

If the tool is text-based, and not intuitive to navigate, you are going to find it difficult to get the entire picture of the processing performed within the warehouse. A graphical design tool is desirable.

• The performance of generated code

.....................................................................................................................................................13-44 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

ETT Tool Selection Criteria

• Activity scheduling and sophistication

• Metadata generation

• Learning curve

• Flexibility

• Supported operating systems

• Cost

.....................................................................................................................................................Data Warehousing Fundamentals 13-45

.....................................................................................................................................................Selecting ETT Tools

Selecting ETT Tools (continued)• Activity scheduling

Can the tool schedule actions to happen and retry if the source or target is not available? Can it report what it has done?

• Scheduling sophistication

Can it schedule based on time of day, time since last try, time since last success, and time period regardless of last attempt?

• Metadata generation by the transformation tool

Generated metadata should be intuitive and easily understood by the business user.

• The learning curve of the tool

• The flexibility of the tool

• The operating system under which the tool runs

Is it supported on all the platforms that you will use for the ETT process?

• Cost

.....................................................................................................................................................13-46 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Transportation Tools

• Informatica OpenBridge

• Oracle SQL*LoaderGatewaysPL/SQLPrecompilers

• Platinum Technology InfoPumpPlatinum Info Transport

Copyright Oracle Corporation, 1999. All rights reserved.

Replication Server Utilities

• Oracle Symmetric and Heterogeneous Replication

.....................................................................................................................................................Data Warehousing Fundamentals 13-47

.....................................................................................................................................................Selecting ETT Tools

Transportation Tools

Replication Server Utilities

WTI Partner ProductInformatica Corp. OpenBridgeOracle SQL*Loader—Direct Path, Direct Path in Parallel

Transparent and Procedural Gateways

PL/SQL

PrecompilersPlatinum Technology, Inc. InfoPump

Platinum Info Transport

WTI Partner ProductOracle Symmetric and Asymmetric Replication

Heterogeneous Replication

.....................................................................................................................................................13-48 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Gateways and Middleware

• Brio Technology DataPrism

• Informatica Corporation OpenBridge

• Information Builders EDA/SQL

• Oracle Gateways

• Platinum Technology InfoHub

• Prism Prism Manager

• Software AG Entire Transaction Propagator

.....................................................................................................................................................Data Warehousing Fundamentals 13-49

.....................................................................................................................................................Selecting ETT Tools

Gateways and MiddlewareWTI Partner ProductBrio Technology DataPrismInformatica Corp. OpenBridgeInformation Builders, Inc. EDA/SQLOracle Oracle Open Gateways

Procedural Gateways

SQL*LoaderPlatinum Technology, Inc. InfoHubPrism Prism ManagerSoftware AG of North America Entire Transaction Propagator

.....................................................................................................................................................13-50 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Summary

This lesson discussed the following topics:

• Capturing changed data

• Applying the changes

• Purging and archiving data

• Publishing the data, controlling access, andautomating processes

• Identifying tools for transporting data into thewarehouse

.....................................................................................................................................................Data Warehousing Fundamentals 13-51

.....................................................................................................................................................Summary

SummaryThis lesson discussed the following topics:

• Capturing changed data

• Applying the changes

• Purging and archiving data

• Publishing the data, controlling access, and automating processes

• Identifying tools for transporting data into the warehouse

.....................................................................................................................................................13-52 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

Copyright Oracle Corporation, 1999. All rights reserved.

Practice 13-1 Overview

This practice covers the following topics:

• Identifying a series statements as true or false

• Answering a series of questions

.....................................................................................................................................................Data Warehousing Fundamentals 13-53

.....................................................................................................................................................Practice 13-1

Practice 13-11 Identify whether the following statements are true or false.

2 Name four different techniques for capturing the changes to operational data that is to be loaded into the warehouse.

_____________________

_____________________

_____________________

_____________________

3 Answer the following questions about updating dimension data.

a What method of updating dimension data would you employ if you wanted to keep old and new records?

b What relationship would that map to in an entity relationship model?

4 What server technique can be used to prevent and allow access to data in the warehouse after refresh?

Statement True FalseThe data refresh cycle is determined primarily by information technology staff input.The load window is the time that the IT group has dictated the data warehouse is available to the users for accessFact data frequently changes.Dimension data infrequently changes.

.....................................................................................................................................................13-54 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 13: Transportation: Refreshing Warehouse Data

.................................

14

Leaving a Metadata Trail

.....................................................................................................................................................14-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Overview

Project Management (Methodology, Maintaining Metadata)

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building

theWarehouse)

Meeting aBusiness

Need

SupportingEnd UserAccess

Managing the Data

Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Objectives

After completing this lesson, you should be able todo the following:

• Define warehouse metadata, its types, and itsroles in a warehouse environment

• Develop a metadata strategy

• Describe in detail each type of warehousemetadata

• List tools for managing metadata

• Describe the Oracle Common WarehouseMetadata architecture (CWM)

.....................................................................................................................................................Data Warehousing Fundamentals 14-3

.....................................................................................................................................................Overview

OverviewMetadata has already been referenced a number of times in this course. It is critical to every phase of warehouse design and development. This lesson examines the role of warehouse metadata in greater detail.

Note that the “Project Management (Methodology, Maintaining Metadata)” block is highlighted in the overview slide on the facing page.

ObjectivesAfter completing this lesson, you should be able to do the following:

• Define metadata, its types, and the main roles of metadata in a warehouse environment

• Describe the challenges of managing warehouse metadata

• List tools for managing metadata

• Describe the Oracle Common Warehouse Metadata architecture (CWM)

.....................................................................................................................................................14-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Defining Warehouse Metadata

• Data about warehouse data and processing

• Vital to the warehouse

• Used by everyone

Metadata

.....................................................................................................................................................Data Warehousing Fundamentals 14-5

.....................................................................................................................................................Defining Warehouse Metadata

Defining Warehouse Metadata

Data About DataMetadata is “data about data.” Warehouse metadata is descriptive data about warehouse data and the processes used in creating the warehouse.

Warehouse metadata contains detailed descriptions of the location, structure, and meaning of data. It describes keys and indexes of the data. It contains mapping information, and it documents the algorithms and business rules used to transform and summarize data. Metadata is used throughout the warehouse, from the extraction stage through the access stage.

Vital to the WarehouseA warehouse with poor metadata is analogous to a filing cabinet filled with folders stored in no particular order. It is very difficult to find your information in the cabinet.

Used by EveryoneWarehouse metadata is used directly or indirectly by everyone involved in creating, maintaining, or using the warehouse: database administrators, analysts, designers, and users. Warehouse metadata answers the following types of question:

• What information is available, by subject area, and when did we start collecting that data?

• How was this summarization created?

• What queries are available to access the data?

• What business assumptions have been made?

• How do I find the data I need?

• How old is the data?

• What does that value mean?

.....................................................................................................................................................14-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

The Key to Understanding WarehouseInformation

• Specifies data location

• Manages data

• Aids use of information

• Describes the data

• Documents thedevelopment process

• Provides a record of changes

• Records enhancements over time

The Key to Understanding

.....................................................................................................................................................Data Warehousing Fundamentals 14-7

.....................................................................................................................................................Defining Warehouse Metadata

Key to Understanding Warehouse InformationMetadata is the component that holds all the information about the data in the warehouse, and presents it as information to the user.

Data becomes and provides information if, and only if, you:

• Have the data

• Know you have it

• Know where it is

• Can access the data

• Can trust the data

Metadata is the key to understanding the warehouse. Metadata helps you locate, manage, and use warehouse information by:

• Specifying the location of data

• Managing data

• Aiding the use of information

• Describing the data

• Documenting the development process

• Providing a record of changes

• Recording enhancements over time

.....................................................................................................................................................14-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Metadata Users

IT developers

End users

Warehouse

Metadatarepository

Operational

ETTEnduser

Copyright Oracle Corporation, 1999. All rights reserved.

Types of Metadata

• End user:– Key to a good warehouse– Navigation aid– Information provider

• ETT:– Maps structure– Source and target information– Transformations– Context

• Operational:– Load, management, scheduling processes– Performance

.....................................................................................................................................................Data Warehousing Fundamentals 14-9

.....................................................................................................................................................Defining Warehouse Metadata

Metadata UsersIn the warehouse, metadata is employed directly or indirectly by all warehouse users for many different tasks.

End Users The decision support analyst (or user) uses metadata directly. The user does not have the high degree of knowledge that the IT professional has, and metadata is the map to the warehouse information. One measure of a successful warehouse is the strength and ease of use of enduser metadata.

Developers For the developer, metadata contains information on the location, structure, and meaning of data, information on mappings, and a guide to the algorithms used for summarization between detail and summary data.

Types of Metadata

End User Metadata Enduser metadata describes the location and structure of data for user access. It describes data volumes and algorithms. Essentially, this is the floor plan that the knowledge worker uses to navigate through and around the data.

ETT Metadata Extraction, transformation, and transportation metadata (sometimes called warehouse metadata or ETT metadata) maps the structure of source systems and how the data is to be transformed into its new format for the warehouse. It contains all the rules for extracting, scrubbing, summarizing, and transporting data. This is often the most difficult metadata model to construct.

Operational Metadata Operational metadata is used by the load, management, and access processes for scheduling data loads or enduser access. It contains information about housekeeping activities, statistics of table usage, and information about every aspect of performance.

Note: The Oracle Method has a specific process for metadata management. Enduser metadata is referred to as business metadata.

.....................................................................................................................................................14-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Developing a Metadata Strategy

• Define a strategy to ensure high-quality metadatauseful to users and developers.

• Primary strategy considerations:

– Define goals and intended use.

– Identify target users.

– Choose tools and techniques.

– Choose the metadata location.

– Manage the metadata.

– Manage access to the metadata.

– Integrate metadata from multiple tools.

– Manage change.

Copyright Oracle Corporation, 1999. All rights reserved.

Defining Metadata Goals and IntendedUsage

• Define clear goals.

• Identify requirements.

• Identify intended usage.

Metadata

.....................................................................................................................................................Data Warehousing Fundamentals 14-11

.....................................................................................................................................................Developing a Metadata Strategy

Developing a Metadata StrategyLike every other aspect of the data warehouse implementation, metadata should be the subject of a well-considered, well-planned strategy. You must ensure that the metadata is of a high quality, provides the right information to users and developers, and is able to take into account the various tools that employ metalayers. Integrating these layers is critical.

Primary ConsiderationsAmong many other considerations, you need to resolve these key issues for the strategy:

• Define the goals and intended use of the warehouse metadata.

• Identify the target users of warehouse metadata.

• Choose tools and techniques for creating and managing metadata.

• Choose the metadata location.

• Manage the metadata.

• Manage access to the metadata.

• Integrate multiple sets of metadata from different tools.

• Manage changes to metadata.

Defining Metadata Goals and Intended UsageIdentify the intention of the metadata you develop. Outline main requirements such as maintaining history, context, and algorithms.

.....................................................................................................................................................14-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Identifying Target Metadata Users

• Who are the metadata users?

– Developers

– End users

• What information do they need?

• How will they access the metadata?

Copyright Oracle Corporation, 1999. All rights reserved.

Choosing Metadata Tools andTechniques

• Tools

– Data modeling

– ETT

– End-user query and analysis

• Database schema definitions

• COBOL copybooks

• Middleware tools

.....................................................................................................................................................Data Warehousing Fundamentals 14-13

.....................................................................................................................................................Developing a Metadata Strategy

Identifying Target Metadata UsersConsider who, among both developers and end users, is to access metadata. What information do they need? Determine how they will access the metadata.

Choosing Tools and Techniques

Data Modeling Tools These tools are also known as computer-aided software engineering (CASE) tools. Oracle’s data modeling tool is Designer.

Some of these tools are better than others at physically modeling metadata. Consider using a tool which either is specifically designed to model warehouse features or is extensible. For example, can the tool model a star or a snowflake?

ETT Tools Tools for extracting, transforming, and transporting data into a warehouse also generate metadata. These tools are expensive purchases, and may not be employed for the first iteration during development. However, these tools have the advantage of being able to create and maintain a metadata layer. The tool itself must have all the information to take source data to the warehouse, so it is logical that the tool itself contains this layer.

End User Tools Some tools for query and analysis allow the administrator to create a metadata layer, which describes the structure and content of the data warehouse.

An administrator must consider a maintenance issue with tool metadata; for each query tool you need to create a unique layer.

Database Schema Definitions Database schema definitions in a relational database management system offer another potential source of metadata. In an Oracle environment this is the Data Dictionary, which can be extended and enhanced.

Most dictionaries of database contents, including the Oracle Data Dictionary, are limited in their immediate value as a metadata tool. Check the extending and enhancing capabilities of these dictionaries.

Other Techniques Less-common sources of metadata include:

• File definitions stored in COBOL copybooks

• Middleware tools

.....................................................................................................................................................14-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Choosing the Metadata Location

• Usually the warehouse server

• Possibly on operational platforms

• Desktop tool with metalayer

Warehouse

Metadatarepository

External sources

Operationaldatasources

Copyright Oracle Corporation, 1999. All rights reserved.

Managing the Metadata

• Managed by the metadata manager

• Maintained by the metadata architect

• Standards produced by the metadata architect

Warehouse

Metadatarepository

External sources

Operationaldatasources

.....................................................................................................................................................Data Warehousing Fundamentals 14-15

.....................................................................................................................................................Developing a Metadata Strategy

Choosing the Metadata LocationFor every process and product employed in the data warehouse environment, metadata exists. Where it is stored is product-specific. The decision about where to place the metadata is often determined by the tool you use to create it.

If you are using a relational database management system, then by default the metadata resides in the database and usually on the warehouse server. This is the preferred method. You may locate the metadata on a separate database on another machine.

Some ETT and query tools have their own metalayer. Where this is the case you need to ensure that each metalayer can communicate with the others.

Managing the Metadata

Management Given the critical importance of metadata within the warehouse environment, it must be subject to strict control and management. Metadata is such a vital component in your warehouse implementation that someone should be responsible for managing and maintaining it.

It is also important to ensure that creation of or changes to metadata are agreed upon with formal sign-off.

Maintenance A metadata architect is usually responsible for defining the strategy and implementing metadata. This person is primarily responsible for ensuring that metadata remains up-to-date and consistently reflects any changes within the business infrastructure.

If there are different metalayers, the architect must control integration of the metadata among products and tools.

Standards As with any development project, standards are critical. Determine standards for every aspect of metadata from simple naming conventions, to versioning requirements, to documenting complex algorithms.

Standards for metadata are emerging within the industry. It is worth monitoring the changes that vendors are considering, as well as the collaborative exercises between large computing companies who are looking to define standards.

.....................................................................................................................................................14-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Integrating Multiple Sets of Metadata

• Multiple tools may generate their own metadata.

• There are many metalayer integration issues.

• Metadata exchangeability is desirable.

Copyright Oracle Corporation, 1999. All rights reserved.

Managing Changes to Metadata

• Different types of metadata have different rates ofchange.

• Consider metadata changes resulting from refreshcycles.

.....................................................................................................................................................Data Warehousing Fundamentals 14-17

.....................................................................................................................................................Developing a Metadata Strategy

Integrating Multiple Sets of MetadataEach of the tools you use in your warehouse environment might generate its own set of metadata. One of the biggest problems with metadata is integrating all of the different layers.

Some vendors provide tools that can exchange metadata. For example, you can take metadata from Oracle Designer, populate it using Prism Directory Manager, and use it directly in Oracle Discoverer.

Later in this lesson, we examine how Oracle Common Warehouse Metadata (CWM) addresses the sharing of metadata among Oracle tools.

Managing Changes to MetadataMetadata changes at different rates according to the type of data stored. For example, models of operational and warehouse databases might remain static for a substantial period of time; however, metadata that maintains information about the warehouse data changes frequently.

Each data refresh brings in more data each cycle. With it, summaries may change, dimensions may change, and more.

.....................................................................................................................................................14-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Examining Types of Metadata

• ETT metadata

• End user metadata

Warehouse

Metadatarepository

ETT

Enduser

Copyright Oracle Corporation, 1999. All rights reserved.

ETT Metadata

• Business rules• Source tables, fields, and key values• Ownership• Field conversions• Encoding and reference table• Name changes• Key value changes• Default values• Logic to handle multiple sources• Algorithms• Time stamp

Stagingfile

Externalsources

Operationaldata

sources

Extraction

.....................................................................................................................................................Data Warehousing Fundamentals 14-19

.....................................................................................................................................................Examining Types of Metadata

Examining Types of MetadataNow we will examine more closely the different types of warehouse metadata. This includes ETT metadata generated during warehouse development, as well as end-user metadata.

ETT MetadataETT metadata defines how data from the physical level in the source system maps to the physical level in the data warehouse. ETT metadata also holds:

• The business rules that are applied to the warehouse data

• Names of the source tables, source fields, and source key values

• Information about the owner of the source data

• The rules that are applied to field conversions on a field-by-field basis

• Encoding and reference table conversions

• Field name and key value changes

• Default values assigned to NULL fields

• Logic to extract records from multiple source systems and create records (or a single record) for the load process

• Algorithms that create derived data:

Unit_Sold / Total_Sales = Selling_Price

• Time-stamp details

You have seen how complex the ETT process is, and you can now appreciate the importance of keeping a record of exactly what is happening, to which data and when, what the grain is, what is derived, how data is summarized, where it is sourced, and what its target is.

.....................................................................................................................................................14-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Extraction Metadata

• Space and storage requirements

• Source location information

• Diverse source data

• Access information

• Security

• Contacts

• Program names

• Frequency details

• Failure procedures

• Validity checking information

Externalsources

Operationaldata

sources

Extraction

.....................................................................................................................................................Data Warehousing Fundamentals 14-21

.....................................................................................................................................................Examining Types of Metadata

Extraction MetadataExtraction metadata contains:

• Space requirement information

• Storage frequency and duration details

• Source location information such as hardware platform information, gateway information, operating system, file system, database, origin and destination information, and loading rules

• Diverse system information with details of the source type such as whether the data is production, internal, external, or archive; structure information such as file type, name, field type, and data granularity

• Access information such as alias names, versions, relationships, data volatility

• Security information, table owners, data owners, authorization levels, audit trail information

• Source data contact and owner details; for example, their names, telephone numbers, e-mail identifiers

• Extraction program names

• Temporary storage details, name of storage file, procedure for removing storage files

• Extraction frequency details

• Extraction failure procedures, with contingency plans and mechanism for handling failed extract

• Extraction validity check information including the procedures to implement, expected results, procedures to follow should the validity check fail, names of the people to contact if the check fails

.....................................................................................................................................................14-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Transformation Metadata

• Duplication routines

• Exception handling

• Key restructuring

• Grain conversions

• Program names

• Frequency

• Summarization

Stagingfile

Externalsources

Operationaldata

sources

Transform

Copyright Oracle Corporation, 1999. All rights reserved.

Transportation Metadata

• Method of transfer

• Frequency

• Validation procedures

• Failure procedures

• Deployment rules

• Contact information

External sources

Operationaldata sources

Stagingfile Warehouse

Metadatarepository

Transport

ETT

Transform

Transport

.....................................................................................................................................................Data Warehousing Fundamentals 14-23

.....................................................................................................................................................Examining Types of Metadata

Transformation MetadataTransformation metadata contains:

• Duplication routines for elimination, consolidation, ordering, and summarization of data

• Exception handling and validation procedures

• Key restructuring rules

• Granularity conversions

• Transformation program names and locations

• Frequency of the transformation

• Summarization procedures

Transportation MetadataTransportation metadata contains:

• Data-transfer methods

• Frequency of transportation

• Validation procedures

• Failure procedures

• Rules for deployment

• Contact information, in case of any issue with the data or the movement of data

.....................................................................................................................................................14-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

End-User Metadata

739516 1816 666 15 17.62

• Need to know the context of the table queried

• Associate the metadata description

• Analogous to Oracle Data Dictionary views

Warehouse

Metadatarepository

Enduser

Copyright Oracle Corporation, 1999. All rights reserved.

Table Column Data MeaningName Name

Product Prodid 739516 Unique identifier for the product

Product Valid_date 01/97 Last refresh date

Product Ware_loc 1816 Warehouse location number

Product Ware_bin 666 Warehouse bin number

Product Code 15 The color of the product; please refer to table COL_REF for details

Product Weight 17.62 Packed shipping weight in kilograms

Example of End User Metadata

.....................................................................................................................................................Data Warehousing Fundamentals 14-25

.....................................................................................................................................................Examining Types of Metadata

End-User MetadataIf the following data is warehouse data, how much can you deduce?

739516 0197 1816 666 15 17.62

You can deduce nothing tangible in this data other than a series of numbers. It could represent product codes, map coordinates, or employee salaries. The only way to deduce information from this data is to know the context of the table you are querying.

For example, if you are querying the PRODUCT table and the PRODUCT CODE column, metadata may show the information as follows:

When you associate the data with its metadata description, the data becomes information.

Table Name Column Name Data MeaningProduct Prodid 739516 Unique identifier for the productProduct Valid_date 01/97 Last refresh dateProduct Ware_loc 1816 Warehouse location numberProduct Ware_bin 666 Warehouse bin numberProduct Color_code 15 The color of the product; please refer to

table COL_REF for detailsProduct Weight 17.62 Packed shipping weight in kilograms

.....................................................................................................................................................14-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

More End-User Metadata Information

• Location of fact and dimensions

• Availability

• Description of contents

• Algorithms for derived and summary data

• Owners of data and telephone number

Warehouse

MetadataRepository

Enduser

.....................................................................................................................................................Data Warehousing Fundamentals 14-27

.....................................................................................................................................................Examining Types of Metadata

More End-User Metadata InformationThe user never accesses end-user metadata directly. This metadata is viewed from the end user’s tool and is used to navigate around the data.

Using this metadata, users can see the data available in the warehouse environment and establish the meaning of elements within the warehouse.

User metadata describes:

• The physical location of fact and dimension data.

• The availability of the data. Not all data components of the warehouse are available to every user. Some facts may be sensitive to specific user groups.

• The exact description of the contents and business algorithms used to create summary data. Users should never be in a position where they are guessing how a summary has been calculated.

• How derived data has been created, the source data, and any algorithms used.

• Data ownership details, so that if there are any problems with the data content, the user can ask the appropriate person questions about the data and identify or rectify the problems found. This information must supply telephone number, fax number, or e-mail address.

Data ownership details are possibly the most important aspect of end-user metadata. If there is an issue with the data, it must be resolved quickly and appropriately.

.....................................................................................................................................................14-28 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Historic Context of Data

• Supports change history

• Maintains the context of information

Operational WarehouseMetadata

repository

Structure

Content

94 95 96 97 98

Copyright Oracle Corporation, 1999. All rights reserved.

Types of Context

• Simple:– Data structures– Naming conventions– Metrics

• Complex:– Product definitions– Markets– Pricing

• External:– Economic– Political

Warehouse

94 95 96 97 98

.....................................................................................................................................................Data Warehousing Fundamentals 14-29

.....................................................................................................................................................Examining Types of Metadata

Historic Context of DataHistoric data often has business rules and algorithms applied that are different from those applied to current data.

In the operational environment, there is only one definition of the database structure at any time. In the warehouse environment, data definitions change over a period of time. It is important to record the date when data changes, names, key values, default values, and algorithms to allow knowledge workers to analyze the data in the correct context. This ensures you can understand and identify the differences in the context of the data in historical files.

For example, you may store data for 1994–1996 offline. Suppose you want to store 1997 data online. The default value for an amount field changed from a series of 9s to 0s in 1995. You can run a query to identify amounts between 1994 and 1997, but if you do not understand when and how default amounts were recorded, you may not be able to explain or understand why both 9s and 0s are stored, or realize the impact that the change has on calculations or reports.

Another example arises with products such as personal computers that had very few components when they were first available. Consider the changes they have gone through and the many components they contain today. There is a rapid and voluminous history of change.

Types of ContextThe context of data in the warehouse may be:

• Simple contextual information such as data structures, data coding, naming conventions, and data metrics

• Complex contextual information such as product definitions, market territories, pricing, packaging, and rule changes

• External contextual information such as economic forecasts, political information, and competitive information

.....................................................................................................................................................14-30 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Additional Metadata Content andConsiderations

• Summarization algorithms

• Relationships

• Stewardship

• Permissions

• Pattern analysis

• Reference tables

.....................................................................................................................................................Data Warehousing Fundamentals 14-31

.....................................................................................................................................................Examining Types of Metadata

Additional Metadata Content and ConsiderationsSome of these points may have already been mentioned.

Summarization Algorithms You have seen that the warehouse contains fully detailed fact records and summary records that are created according to predefined algorithms. The meaning of the summaries is maintained in the metadata.

Relationships Relationships show how tables are related, their constraints and rules, and the cardinality of data. This relationship information is maintained in the metadata. This information is documented along with ownership information and text descriptions of tables and keys.

Stewardship Metadata must identify the originator of data. Bear in mind that the data in the warehouse has come from many different source systems, with different suppliers, different owners, and different transformation issues.

Permissions Metadata should maintain, for each record, information about who can access the records and who is authorized to grant permissions on it.

Access Pattern Analysis Metadata should be able to record frequently accessed data, in order to tune and optimize performance accordingly. In turn, this may identify data accessed infrequently or not at all. You should remove data and summaries that are not accessed.

Reference Tables The contents of these tables must be monitored and maintained with information that relates to their effective date.

.....................................................................................................................................................14-32 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Metadata Management Tools

• Carleton

• Evolutionary Technologies

• Hewlett Packard

• Informatica

• Information Advantage

• Oracle Designer

• Platinum Technology

• Prism Solutions

• Sagent

.....................................................................................................................................................Data Warehousing Fundamentals 14-33

.....................................................................................................................................................Metadata Management Tools

Metadata Management Tools

There are two categories of metadata management tools:

• Generic repository tools, for managing enterprisewide metadata, such as:

– Data Shopper from Platinum Technology

– Data Dictionary from Brownstone/Platinum

– Manager Link from Manager Software Products

• Tools specifically for data warehouses and data marts, such as:

– Prism Directory Manager from Prism Solutions

– Meta Agent from Information Advantage

– Passport from Carleton Corporation

– SmartData Warehouse from Intersolv

WTI Partner ProductCarleton Corp. Carleton Passport-MetadataEvolutionary Technologies ETI Repository (ObjectStore)Hewlett Packard IW GuideInformatica Informatica RepositoryInformation Advantage Meta AgentOracle Designer

Data Mart Suite

OADW/Warehouse BuilderPlatinum Technology, Inc. Data Dictionary/Solution Repository (DD/

S), Data Shopper, DB Excel, Platinum Repository

Prism Solutions Prism Directory ManagerSagent

.....................................................................................................................................................14-34 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Common Warehouse Metadata

Design and Administration

Mine

Analyze

Query

Marts

Metadata

Dataintegration

Informationdelivery

Warehouse

ERPdata

Externaldata

Operationaldata

Report

Analytic applications

Copyright Oracle Corporation, 1999. All rights reserved.

Common Warehouse Metadata Future

WarehouseBuilder

Discoverer

Oracle8iServer

ExpressServer

Commonmetadata

.....................................................................................................................................................Data Warehousing Fundamentals 14-35

.....................................................................................................................................................Common Warehouse Metadata

Common Warehouse MetadataCommon Warehouse Metadata (CWM) is Oracle’s open standard for data warehousing metadata. CWM incorporates both technical and business meta data and covers all aspects of warehousing. CWM will enable tighter integration of metadata among Oracle’s products as well as across industry-leading tools from Oracle partners, resulting in reduced implementation complexity and greater user productivity.

To enable truly open data warehouse functionality, Oracle submitted a Request for Proposal for a Common Warehouse Metadata Interchange standard to the Object Management Group (OMG). The Common Warehouse Metadata Interchange (CWMI) standard will enable the interchange of warehouse metadata among data management and analysis tools, and among warehouse metadata repositories.

One MeaningOracle acquired One Meaning, a company specializing in metadata. One Meaning’s metadata technology provides the means for metadata interoperability and transfer, reduces the cost of managing information resources, and enhances the value of stored proprietary information. Oracle’s metadata strategy will provide essential integration and continuity, and add ongoing value to data warehousing implementations.

.....................................................................................................................................................14-36 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Summary

This lesson discussed the following topics:

• Definitions

• Integration

• Contents

• Storage

• Creation

• Selection

• Tools

.....................................................................................................................................................Data Warehousing Fundamentals 14-37

.....................................................................................................................................................Summary

SummaryThis lesson discussed the following topics:

• The definitions of the two main types of metadata

• The problems associated with metadata in the warehouse

• Metadata contents

• How metadata might be created

• Where metadata may be stored in a warehouse environment

• Selection criteria for metadata management tools

• A list of metadata management tools available from WTI partners and Oracle

.....................................................................................................................................................14-38 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

Copyright Oracle Corporation, 1999. All rights reserved.

Practice 14-1 Overview

This practice covers the following topic:Answering a series of short questions

.....................................................................................................................................................Data Warehousing Fundamentals 14-39

.....................................................................................................................................................Practice 14-1

Practice 14-11 Why is metadata important to the following people?

a Users who are accessing the data warehouse

________________________________________________________

________________________________________________________

b IT staff developing ETT routines

________________________________________________________

________________________________________________________

2 Name two techniques you might employ to create metadata.

________________________________________________________

________________________________________________________

3 Name two roles within the data warehouse development team who have responsibility for metadata.

________________________________________________________

________________________________________________________

4 What is the issue with integration and metadata?

________________________________________________________

________________________________________________________

________________________________________________________

5 What is important about the context of data?

________________________________________________________

________________________________________________________

6 Name the Oracle tools you may use to develop metadata.

________________________________________________________

.....................................................................................................................................................14-40 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 14: Leaving a Metadata Trail

.................................

15

Supporting End-UserAccess

.....................................................................................................................................................15-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Overview

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building theWarehouse)

Meeting aBusiness

NeedManaging the Data

Warehouse

SupportingEnd UserAccess

SupportingEnd UserAccess

Copyright Oracle Corporation, 1999. All rights reserved.®

Objectives

After completing this lesson, you should be able todo the following:

• Describe the importance of business intelligence

• Identify multidimensional query techniques

• Identify where data mining might be employed in awarehouse environment

• Identify data mining tools

.....................................................................................................................................................Data Warehousing Fundamentals 15-3

.....................................................................................................................................................Overview

OverviewThe previous lesson covered leaving a metadata trail. This lesson discusses supporting end-user access. Note that the “Supporting End User Access” block is highlighted in the course road map on the facing page.

Specifically, this lesson introduces the concept of business intelligence. The lesson discusses the discovery model used by mining tools, and the reasons enterprises are looking at data mining solutions for discovery of information.

ObjectivesAfter completing this lesson, you should be able to do the following:

• Describe the importance of business intelligence

• Identify multidimensional query techniques

• Identify where data mining might be employed in a warehouse environment

• Identify data mining tools

.....................................................................................................................................................15-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Gartner Group

What Is Business Intelligence?

“Business Intelligence is the process of transformingdata into information and through discoverytransforming that information into knowledge.”

Copyright Oracle Corporation, 1999. All rights reserved.®

Business Intelligence

The purpose of business intelligence is to convert thevolume of data into value for the end users.

Decision

Knowledge

Information

Data

Stages (4)

Volume

Value

.....................................................................................................................................................Data Warehousing Fundamentals 15-5

.....................................................................................................................................................Business Intelligence

Business IntelligenceCompanies require business intelligence to direct business process improvement and monitor time, cost, quality, and control.

DefinitionHoward Dressner, analyst with the Gartner Group, defines business intelligence as a process of turning data into information and through iterative discoveries turning that information into business intelligence. The key is that business intelligence is a process—cross functional, in line with current management thinking, and not presented in IT terms.

Purpose of Business IntelligenceThe purpose of business intelligence is to the large volumes of data into information, linking bits of information together within a decision context that turns it into knowledge that can be used to aid decision making.

This can be accomplished through the use of data access tools and techniques that use organized collections of data, systems, and applications by which organizations gather and interpret relevant information about the business and turn it into highly quantifiable plans, policies, procedures, and metrics.

The value chain begins with data resource. Data is defined as facts and figures.

Information is data processed and interpreted into a meaningful framework. It is a set of data in context that is relevant to one or more people at a point in time or for a period of time.

Knowledge refers to meaning and understanding that results from processing information by users. In order for knowledge to be useful in the decision making process, there must be a high-quality integrated resource, high-quality information preparation and sharing, and a high-quality human resource to discover and accumulate knowledge to achieve successful business intelligence.

.....................................................................................................................................................15-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Multidimensional Query Techniques

Slicing

Dicing

Drillingdown

Product

Time

GeographyWhat? Why?

Why?

Why?

Copyright Oracle Corporation, 1999. All rights reserved.®

Multidimensional Query Techniques

Drillingup

Drillingacross

Pivoting

What? Why?

Why?

Why?

.....................................................................................................................................................Data Warehousing Fundamentals 15-7

.....................................................................................................................................................Multidimensional Query Techniques

Multidimensional Query TechniquesThese techniques are standard in modern query tools that present data in a multidimensional manner. The following defines some of the common multidimensional query techniques.

SlicingSlicing means limiting the view of data to a selection of the data to a selection of consultant, region, or cost center. An example of a slice of data can be a view of the data for a regional manager across all products and time periods

DicingDicing is slicing in multiple directions. You are making the selection along more than one dimension. In dicing, you can refine the selection by adding or removing data more of the data cube.

DrillingDrilling is being able to open up a subset of data that corresponds with a particular value of a dimension. It is a term used to describe the action of moving down to further levels of data detail or up to higher levels of summary data.

Drilling Down Is a mechanism that enables the user to examine the detail for a summary value. The user may examine where rackets were sold, to what companies, and how many items any individual purchased.

Drilling Up Is the ability to query detail records and navigate up to higher level summary records.

Drilling Across Is the ability to query from one fact table to another in a single report.

Pivoting Pivoting data is changing the axes along which you orient your data. It also refers to the ability to change the organization of rows and columns in a tabular report. This enables the user to view the data along different dimensions without requerying the database itself.

OLAP has other associated query techniques, some of which are vendor dependent. For example top/bottom analysis selects the top or bottom ranges of data based on criteria to perform exception reporting.

.....................................................................................................................................................15-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Categories of Business Intelligence Tools

• Reporting tools

• Query tools (data access)

• On-line analytical reporting (OLAP) tools

• Analytical suites

• Data mining tools

• Analytical applications

Copyright Oracle Corporation, 1999. All rights reserved.®

Evolution of Reporting

MainframeClient-Server

• Batch oriented

• IS controlled

• 3GL-based

• Not user-specific

• Inflexible

• IS intensive

• End user empowered

• Reduced IS manageability

• Expensive

• Localized

Multitierenterprisereporting

• Easy to use

• Manageable

• Scalable

• Accessible

.....................................................................................................................................................Data Warehousing Fundamentals 15-9

.....................................................................................................................................................Categories of Business Intelligence Tools

Categories of Business Intelligence ToolsAccording to Wayne Eckerson from the Data Warehousing Institute (Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, Volume 4, Number 1, Spring 1999), the categories of business intelligence tools are:

• Reporting tools

• Query tools (data access)

• Online analytical reporting (OLAP) tools

• Analytical suites

• Data mining tools

• Analytical applications

Reporting ToolsThe tools allow users to produce canned, graphic-intensive, sophisticated reports based on the warehouse data. The evolution of reporting is shown below.

Mainframe In the mainframe era, batch reporting generated large, cumbersome reports. These reports were constructed from time consuming, difficult to use 3GL programming environments.

Client/Server The advent of the PC brought rich graphical user interfaces, leading to the introduction of much more productive 4GL reporting tools. This, combined with the advent of client-server computing, began to deliver much more user-friendly and flexible reports.

Enterprise Reporting We are now in the enterprise reporting era. This new reporting architecture delivers the combined benefits of mainframe and client-server environments.

Oracle Reports is an enterprise reporting tool for developers to build and disseminate sophisticated, high-quality reports. Users view reports dynamically generated by the application-server-reporting engine. Users can access reports from anywhere in the enterprise using a web browser.

Oracle Reports takes advantage of the scalability of the internet computing model. The powerful reports server helps you to easily deploy your applications in a multi-tier environment that uses an advanced caching technology to provide dynamic load balancing.

.....................................................................................................................................................15-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Oracle Discoverer 3.1

Transaction Database or Data Warehouse/Mart

End User Layer

Administration

Edition

User

Edition

Viewer

Edition

Copyright Oracle Corporation, 1999. All rights reserved.®

Discoverer for the Web

• View workbooks using a Web browser

• Business intelligence tool that providesinformation anywhere and at any time

• Cost-effective

.....................................................................................................................................................Data Warehousing Fundamentals 15-11

.....................................................................................................................................................Categories of Business Intelligence Tools

Query ToolsThese tools enable users to explore a data source using intuitive ad hoc queries. The tools provide the means for pulling the desired information from a database. They are typically SQL-based tools and allow a user to define data in end-user language.

Oracle Discoverer is Oracle’s award-winning ad hoc query, reporting and analysis tool designed by end users for the end users. Oracle Discoverer for the Web makes it easy for any user to leverage information in data warehouses, data marts, and relational databases using a web browser. It features industry-leading ease of use and performance features such as query prediction and automatic summary management which provide time and cost savings for the enterprise. The components of Oracle Discoverer 3.1 are shown below.

Discoverer User Edition As an end user, you use this component to perform ad hoc queries, generate reports, and publish information stored in the online dictionary.

Discoverer Administration Edition Business and information technology (IT) data administrators use this component to create, maintain, and administer data and the users’ interaction with that data.

End-User Layer This component, a server-based meta layer, hides the complexity of the underlying relational database so that you can interact with the online dictionary using ordinary business terms.

Discoverer Viewer Edition As an end user, you use this component to view your data using a Web browser. Using the Discoverer Viewer, you can view the workbooks that you have created in the User Edition, through the Internet. You can use Internet Explorer 4.0 or Netscape 4.05 or higher browsers to access Discoverer Viewer, and it takes advantage of the existing Discoverer installations, thus providing easy access at any time to the workbooks stored in the database. Because of the consistent user interface between the User Edition and the Viewer Edition, users can easily work with their stored workbooks in Discoverer Viewer without any additional training.

The following features are available in Discoverer Viewer:

• View workbooks stored in the database

• Use drilling

• Refresh data

• Print reports

• Provide parameters to view specific data

• Customize the execution of queries

.....................................................................................................................................................15-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Online Analytical Processing (OLAP)

Regional mgr. view

Financial mgr. view Ad hoc view

Product mgr. view

Time

Prod

SalesMarket

Copyright Oracle Corporation, 1999. All rights reserved.®

Advanced Analytical Tasks

• Comparative and relative analysis

• Exception and trend analysis

• Time series analysis

• Forecasting

• What-if analysis

• Modeling

• Simultaneous equations

.....................................................................................................................................................Data Warehousing Fundamentals 15-13

.....................................................................................................................................................Categories of Business Intelligence Tools

Online Analytical Reporting (OLAP) ToolsOLAP tools provide a multidimensional view of data, allowing users to easily navigate through multiple dimensions (such as customer, organization and time) and hierarchies within dimensions (such as year, quarter, and month).

The different types of tools in this category are multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and hybrid OLAP (HOLAP). They have been discussed in Lesson 6.

Oracle Express Oracle Express provides sophisticated online analytical processing (OLAP) analysis through its advanced calculation engine and multidimensional data cache. The Express multidimensional data model is optimized for the query and analysis of corporate data, such as sales, marketing, financial, manufacturing, or human resource data.

Oracle Express provides a native multidimensional data model for optimal OLAP power and performance. The multidimensional model:

• Is specifically designed for analysis

• Inherently reflects the way users think about their businesses

• Ensures that end users can efficiently analyze data in a structured or ad hoc fashion, without requesting special programs from IS personnel

Through its built-in analytic functions, Oracle Express provides the answers to a range of complex analytic questions.

Oracle Express enables users to perform advanced analytical tasks, such as:

• Comparative and relative analysis

• Exception and trend analysis

• Modeling

• Forecasting

• Time-series analysis

• What-if analysis

It delivers powerful analytical capabilities to any Web browser, enabling sophisticated analysis over corporate intranets and the Internet.

.....................................................................................................................................................15-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Analytical Suites

• Enterprise business intelligence (EBI) toolsets:

– Web-enabled query, reporting, and analysistool that runs on a robust application server

– EBI toolset tightly integrates query, reporting,and analysis capabilities within a single tool

– Shares a common look and feel

• Business portals:

– EBI toolset with a Yahoo!-like user interface

– Flexible repository handles structured andunstructured data objects

Data Warehousing Institute

Copyright Oracle Corporation, 1999. All rights reserved.®

Data Mining Tools

• Identify patterns and relationships in data that areoften useful for building models that aid decisionmaking or predict behavior

• Data mining uses technologies such as neuralnetworks, rule induction, and clustering todiscover relationships in data and makepredictions that are hidden, not apparent, or tocomplex to be extracted using statisticaltechniques.

.....................................................................................................................................................Data Warehousing Fundamentals 15-15

.....................................................................................................................................................Categories of Business Intelligence Tools

Analytical SuitesAccording to Wayne Eckerson from the Data Warehousing Institute, the tools in the analytical suites are as follows.

Enterprise Business Intelligence (EBI) Toolsets An EBI toolset is a Web-enabled query, reporting, and analysis tool that runs on a robust application server instead of a desktop machine. An EBI toolset tightly integrates query, reporting, and analysis capabilities within the context of a single tool as opposed to a suite of tools. Each analytical “modality” shares a common look and feel and passes data seamlessly to each of the other modalities, as required. Web and client-server versions offer equivalent functionality.

Business Portals A Business Portal is an EBI toolset with a Yahoo!-like user interface. This tool has a flexible repository that handles structured and unstructured data objects, and a publish or subscribe engine that delivers reports to users on a customizable basis.

(Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, pg. 29, Volume 4, Number 1, Spring 1999)

Data Mining ToolsData mining tools identify patterns and relationships in the data that are often useful for building models that aid decision making or predict behavior. Data mining uses technologies such as neural networks, rule induction, and clustering to discover relationships in data and make predictions that are hidden, not apparent, or to complex to be extracted using statistical techniques.

Note: Data mining will be covered in the next section.

.....................................................................................................................................................15-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Analytical Applications

• Packaged analytical application has a predefined:

– Extraction feeds and transformation routinesfor a specific data source

– Data model, application-specific reporttemplates, and a custom end-user interface.

• Custom analytic applications are workbenchesthat enable developers to quickly create analyticapplications from coarse-grained components,including user interface widgets, data access andanalysis components, and report layouts.

Data Warehousing Institute

.....................................................................................................................................................Data Warehousing Fundamentals 15-17

.....................................................................................................................................................Categories of Business Intelligence Tools

Analytical ApplicationsAccording to Wayne Eckerson from the Data Warehousing Institute,

“Analytical applications incorporate business intelligence tools and a data warehouse or data mart to deliver analytical capabilities within a well-defined business process. An analytical application uses a custom interface to step users through a set of data collection and analysis tasks that lead up to a decision. The analytical application also provides the context for users to act on their business decisions, whether it involves emailing a document, updating a database, or initiating a workflow.”

(Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, pg. 29, Volume 4, Number 1, Spring 1999)

The tools in the analytical applications are described below.

Packaged Analytic Application Packaged analytic applications come with a predefined extraction feeds and transformation routines for a specific data source, a predefined data model, application-specific report templates, and a custom end-user interface.

Custom Analytic Application The custom analytic applications are workbenches that enable developers to quickly create analytic applications from coarse-grained components, including user interface widgets, data access and analysis components, and report layouts.

(Wayne Eckerson, Criteria for Evaluating Business Intelligence Tools, Journal of Data Warehousing, pg. 29, Volume 4, Number 1, Spring 1999)

.....................................................................................................................................................15-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Definition of Data Mining

“Data mining is the exploration and analysis of largequantities of data in order to discover meaningfulpatterns, trends, relationships, and rules.”

Data mining is also known as:

• Knowledge discovery

• Data surfing

• Data harvesting

Copyright Oracle Corporation, 1999. All rights reserved.®

Uses of Data Mining

• Customer profiling

• Market segmentation

• Buying pattern affinities

• Database marketing

• Credit scoring and risk analysis

1000 2000 2000 3456 6577

2000 56600 78797 990

90091 87885 4565 12854

12090 123599 279878 999

109988 1987363 10928783

33345 67398 320793 39384

320983 57583 398 209

8378373 10076 354802

2973673 3939399 306145

01910 46458 817262

.....................................................................................................................................................Data Warehousing Fundamentals 15-19

.....................................................................................................................................................Data Mining in a Warehouse Environment

Data Mining in a Warehouse Environment

Definition of Data MiningData mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns, trends, relationships, and rules. The purpose of data mining is to enable proactive business decisions. Data mining tools empower the user to search for patterns of information in data. Data mining is far less user-directed and relies upon specialized algorithms, such as fuzzy logic, neural networks, genetic algorithms, and induction, that correlate information from the data warehouse and assist in trend analysis. Data mining also refers to a process rather than a technology, the goal of that process being to explore large amount of data to discover new trends, relationships, and categories in that data. Data minng is also referred to as knowledge discovery, data surfing, or data harvesting.

Uses of Data MiningData mining has many applications:

• Store owners can use it to determine and market products according to user classification.

– Affinities

– Purchasing patterns

– Goods purchased (basket analysis)

• Business analysts can use it to determine patterns of product purchases.

– Fraud detection

– Profile buying patterns

– Determining high-and-low risk customers

• Credit card suppliers can use it to target an audience for a new card service. Credit scoring and risk analysis in financial institutions.

Data mining techniques can be used by anyone who needs to:

• Develop strategies for marketing

• Target mail lists

• Adjust inventory levels

• Minimize operational and financial risks to the business

• Keep costs to a minimum

• Find out something new and never before considered

.....................................................................................................................................................15-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Functions of Data Mining

• Discovers facts and data relationships

• Finds patterns

• Determines rules

• Retains and reuses rules

• Presents information to users

• May take many hours

• Requires knowledgeable people toanalyze the results

.....................................................................................................................................................Data Warehousing Fundamentals 15-21

.....................................................................................................................................................Data Mining in a Warehouse Environment

Functions of Data Mining

Discovery Data mining queries discover facts and data relationships using techniques such as association, frequency of occurrence, and sequential patterns.

Rule Retention Data mining techniques learn patterns, and create rules to describe the patterns; the rules are retained for reuse against larger data sets of data for further analysis.

Self-Motivating Some data mining queries require little human intervention, but do need guidance. Certain data mining models, such as cluster analysis, do not require any guidance at all. On the whole, data mining tasks are a guided discovery of data, that is, you have a notion of what it is you are trying to find out—information about debtors or selling patterns, for example.

Expert Analysis The results of a query, once presented, need knowledgeable people to analyze and use them correctly.

.....................................................................................................................................................15-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

• DSS queries:

– Based on prior knowledge and assumptions

– User-driven

• Data mining queries:

– Require domain-specific knowledgeto interpret data

– User-guided

Comparing DSS and Data Mining Queries

.....................................................................................................................................................Data Warehousing Fundamentals 15-23

.....................................................................................................................................................Data Mining in a Warehouse Environment

Comparing DSS and Data Mining QueriesDecision support queries are driven by a user who knows how to pose a question in order to achieve specific results. The user knows what the question is and requires the DSS application only to supply the answer. Therefore, the user applies known parameters to the query prior to execution, in order to achieve a result based on those known parameters.

Data mining queries differ in that the user provides some initial guidance. It requires users to have the domain-specific knowledge to interpret the data. Data mining can find answers to problems and information you have not considered before.

.....................................................................................................................................................15-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Artificial Neural Networks

• Predictive model that learns

• Developed from understanding of the human brain

• Multiple regression and other statisticaltechniques

1

2

3

4Inputs Hidden layer Outputs

5

6

7

8

Copyright Oracle Corporation, 1999. All rights reserved.®

Decision Trees

• Represent decisions

• Generate rules

• Classify

Annual salary100,000

> 50,000<10,000

Good Bad

Annualoutgoing

Annualcredit

.....................................................................................................................................................Data Warehousing Fundamentals 15-25

.....................................................................................................................................................Data Mining in a Warehouse Environment

Data Mining Techniques

Artificial Neural Networks Neural networks are nonlinear predictive models that learn through training. They look like biological neural networks in structure.

A neural network is a network of processors, each of which contains an amount of local memory. The units are connected by communication channels carrying numeric data, encoded by various means. The processors operate only on their local data and on the inputs they receive through the communication channels. The field of neural networks arose from the development of artificial intelligence systems (among other technologies) capable of sophisticated computations similar to those performed by the human brain.

Much of the improvements in neural network technology have been applied since there has been much improved understanding of how the human brain functions.

Most neural networks have a training rule whereby the weights of communications are adjusted based on the data; that is, they learn from examples. Neural networks are employed by statisticians, engineers, scientists, and neurophysiologists to explore brain function.

Neural networks can be used for classification, clustering, modeling, determining sequences, and multiple regression and other statistical techniques.

Decision Trees These are tree-shaped structures that show a route taken by a certain decision, or a series of decisions. Each decision generates a rule to classify the data that it returns. A bank may use a decision tree to determine the worthiness of a customer requesting a loan (is the customer a good or a bad risk?). This is classification.

Some tools that support decision tree technology (rather than data mining technology) can display decision tree results graphically.

.....................................................................................................................................................15-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Other Techniques

• Genetic algorithms based on evolution theory

• Statistics such as averages and totals

• Nearest neighbor to find associations

• Rule induction applying IF-THEN logic

• Experiment with different techniques

KK

K

K

K

K

K

K

KK

K

.....................................................................................................................................................Data Warehousing Fundamentals 15-27

.....................................................................................................................................................Data Mining in a Warehouse Environment

Data Mining Techniques (continued)

Genetic Algorithms These are essentially optimization techniques using processes such as natural selection and genetic combination. The design is based on the concepts of evolution (Darwin’s theory of the survival of the fittest) and mutation theories.

Statistics and Quantitative Analysis Data mining uses statistics based on linear models that may be quite complex, such as averages, distributions, ranking, regression, clustering, and other statistical techniques. There is an overlap between the fields of neural networks and statistics.

Nearest Neighbor This technique is used for finding associated or clusters of records. It classifies each record in a select set of data, based on a combination of the classes of the K records most similar to it, where K is greater than or equal to one.

Rule Induction Data mining can extract useful IF-THEN rules based on the statistical significance of the data. Rule induction allows you to find data associations and sequences, and employs decision tree techniques for prediction and analysis.

No single mining technique can be recommended in isolation. The data to be analyzed varies between businesses; the hypotheses tested are diverse. You should consider employing as many techniques as the tool allows; you must experiment.

Note: There are many other techniques used in data mining. This is just a sample selection.

.....................................................................................................................................................15-28 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Associations

Which items are purchased in a retail store at thesame time?

Copyright Oracle Corporation, 1999. All rights reserved.®

Sequential Patterns

What is the likelihood that a customer will buy aproduct next month, if he buys a related item today?

.....................................................................................................................................................Data Warehousing Fundamentals 15-29

.....................................................................................................................................................Data Mining in a Warehouse Environment

Typical Data Mining Results

Associations Data mining can discover associations between items, that is, how items relate to each other. It answers questions such as, “Which items are purchased in a retail store at the same time?” For example, shirts and ties, eyeliner and mascara, or cameras and televisions. However, this result does not determine the rationale behind the association.

Sequential Patterns Data mining can describe associations over some period of time. It can answer questions such as, “What is the likelihood that a customer will buy a product in the future, if he buys a related item today?” For example, personal computer today, printer next month; or a set of tools today and the toolbox to put them in tomorrow.

Patterns involving time emerge. For example, if a customer buys a set of tools today, there may be a pattern that shows the percentage likelihood of the toolbox being purchased tomorrow, within one week, or within two weeks. This is a good way for a retail store to determine a marketing campaign. Classification results enable the store to target the correct customer at the same time.

.....................................................................................................................................................15-30 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Classifications

Determine customers’ buying patterns, and then findother customers with similar attributes that may betargeted for a marketing campaign.

Copyright Oracle Corporation, 1999. All rights reserved.®

Modeling

Use factors, such as location, number of bedrooms,and square footage, to determine the market value ofa property

.....................................................................................................................................................Data Warehousing Fundamentals 15-31

.....................................................................................................................................................Data Mining in a Warehouse Environment

Typical Data Mining Results (continued)

Classification Data mining can divide items into groups.

Determine customers’ buying patterns, and then find out other customers with similar attributes that may be targeted for a marketing campaign: credit card users with balances within 10% of their maximum credit limit; people employed in the construction industry.

Modeling Data mining can map a set of input values to a single output value.

For example, you may use factors such as location, number of bedrooms, and square footage to determine the market value of a property.

.....................................................................................................................................................15-32 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Oracle Data Mining Partners

• Angoss International, Ltd.

• DataMind Corp.

• Datasage, Inc.

• Information Discovery, Inc.

• SPSS Inc.

• SRA International, Inc.

• Thinking Machines Corp.

.....................................................................................................................................................Data Warehousing Fundamentals 15-33

.....................................................................................................................................................Oracle Data Mining Partners

Oracle Data Mining PartnersWTI Partner ProductAngoss International, Ltd. KnowledgeSeeker IV is a data mining software tool that

uses a unique cross-referencing process to enable businesses to analyze varied and disparate databases.

DataMind Corp. DataMind DataCruncher provides fast, accurate data mining capabilities for making sense of corporate data.

Datasage, Inc. DataSage Mining Manager provides a robust infrastructure to develop, deploy, and manage enterprise data mining applications ensuring a complete solution that will increase corporate profitability and reduce the time to ROI for data mining projects.

Information Discovery, Inc. Data Mining Suite is an integrated set of products providing powerful, complete, and comprehensive solutions for large-scale enterprisewide decision support and data mining.

Rapid Pilot Data Mining is designed for Fortune 2000 companies wanting to accelerate the data-mining introduction process and quickly gain notable results.

Knowledge Access Suite has delivered the first and only set of products ever to provide business users with a gateway to knowledge predistilled from raw data and stored in a pattern base.

SPSS Inc. SPSS is an open, best-of-breed data mining solution that delivers each of the four A’s of data mining, access, analysis, action, and automation.

SRA International, Inc. KDD Explorer is an easy-to-use data mining toolset that assists business analysts in the discovery and analysis of novel patterns in terabyte-sized databases.

Thinking Machines Corp. LoyaltyStream is a complete solution that includes specific applications, software, user training, and expert consulting services for understanding customer behavior, building mining marts, building predictive models, and deploying models throughout an enterprise.

.....................................................................................................................................................15-34 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

Copyright Oracle Corporation, 1999. All rights reserved.®

Summary

This lesson covered the following topics:

• Describing the importance of business intelligence

• Identifying where data mining might be employedin a warehouse environment

• Identifying data mining tools

.....................................................................................................................................................Data Warehousing Fundamentals 15-35

.....................................................................................................................................................Summary

SummaryThis lesson covered the following topics:

• Describing the importance of business intelligence

• Identifying where data mining might be employed in a warehouse environment

• Identifying data mining tools

.....................................................................................................................................................15-36 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

.

Copyright Oracle Corporation, 1999. All rights reserved.®

Practice 15-1 Overview

This practice covers the following topics:

• Identifying the type of analysis based a descriptionof a scenario

• Matching the category of information with a list ofdescription

• Identifying data mining techniques

.....................................................................................................................................................Data Warehousing Fundamentals 15-37

.....................................................................................................................................................Practice 15-1

Practice 15-11 In the following scenarios, choose the type of analysis that most accurately defines

the scenario. The types of analysis from which you may choose are:

– Query and reporting

– Multidimensional/OLAP

– Data mining

– Drill-down and pivot

– Calculations and derived data

– Spreadsheet

– Modeling, time-series and financial

– What if

2 For the following phrases and sentences, determine which category each of them belongs to. You may choose from the following list.

• Data

• Information

• Knowledge

Scenario Type of Analysisa. Show start date and salary grade for all employees reporting to

Clare Mauryb. Highlight all orders above $30,000.00

• Drill from product totals to individual orders• Look at a copy of the invoice

c. Show product sales in each region as a percentage of the total sales in that region.

d. Did the $2 million promotion increase sales?e. How many people to hire, when to hire them, and where to

locate them.f. If we lowered prices, would our overall revenue increase?g. Find me the relationship between X and Y.h. Show me all the products that are currently back-ordered.i. What is the 13 week moving average of sales?j. Projecting costs and allocating overhead based on head count, sales forecasts, and consumer price index (CPI).

.....................................................................................................................................................15-38 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 15: Supporting End-User Access

• Decision

3 The diagram below illustrates an example of data mining. The technique that it uses is called _________________.

4 The description below describes a data mining technique. What is the technique used?

Description CategoryMary lives in Belmont Shores, California.Point of sale (POS)AppleTree juice is bought 45% of the time that Crystal Geyser juice is bought.Let us promote Crystal Geyser juice on the East Coast of the United States in stores.DemographicCustomers of the upper middle class will use 10% of their annual income during the Christmas holiday season.

Age

Region

Call Rate

Service

Lost

Loyal

1. If the vehicle has a 2-door frame AND

2. If the vehicle has at least six cylinders AND

3. If the buyer is less than 40 years old AND

4. If the cost of the vehicle is > $35,000 AND

5. If the vehicle color is red, THEN

6. The buyer is likely to be male.

.................................

16

Web-Enabling theWarehouse

.....................................................................................................................................................16-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Overview

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building theWarehouse)

Meeting aBusiness

NeedManaging the Data

Warehouse

SupportingEnd UserAccess

SupportingEnd UserAccess

Copyright Oracle Corporation, 1999. All rights reserved.®

Objectives

After completing this lesson, you should be able todo the following:

• Explain how the Web can expand data warehouseusage

• Describe the issues involved in putting a datawarehouse on the Web

• Outline the requirements for evaluation Web-basedquery and analysis tools

.....................................................................................................................................................Data Warehousing Fundamentals 16-3

.....................................................................................................................................................Overview

OverviewThe previous lesson covered supporting end-user access. This lesson discusses Web-enabling the warehouse which is also another aspect of supporting end-user access to the warehouse. Note that the “Supporting End User Access” block is highlighted in the course road map on the facing page.

Specifically, this lesson discusses how to take advantage of the Web to deploy data warehouse information. It addresses internal and external access, as well as the advantages of Web-enabling a data warehouse. The lesson outlines the steps involved in deploying a Web-enabled data warehouse. Challenges in deploying a Web-enabled data warehouse are also discussed.

ObjectivesAfter completing this lesson, you should be able to do the following:

• Explain how the Web can expand data warehouse usage

• Describe the issues involved in putting a data warehouse on the Web

• Outline the requirements for evaluating Web-based query and analysis tools

.....................................................................................................................................................16-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Benefits of Web-Enablinga Data Warehouse

• Better-informed decision making

• Lower costs of deployment and management

• Lower training costs

• Remote access

• Enhanced customer service and improved imageas a technology leader

• Greater collaborationamong users

.....................................................................................................................................................Data Warehousing Fundamentals 16-5

.....................................................................................................................................................Accessing the Warehouse Over the Web

Accessing the Warehouse Over the WebA Web-enabled data warehouse is a means of providing access and query availability to your data warehouse by using a standard Web browser. It allows your users to perform ad hoc queries against the database using their choice of Web browsers.

The primary purpose of Web-enabling a data warehouse is to give remote offices and mobile professionals the information they need to make tactical business decisions. Companies are increasingly aware that the Internet can help them reach out to new markets and increase their values to customers, particularly by offering individualized, one-to-one marketing.

Benefits of Web-Enabling a Data WarehouseDeploying data warehouse applications on the Web is becoming increasingly popular. The benefits of a Web-enabled data warehouse are:

• Better-informed decision making: Users with access to more comprehensive information and analyses can make better decisions, with the results directly affecting the organization’s bottom line.

• Lower costs of deployment and management: A Web browser serves many clients from a single location, reducing the number of installations and upgrades needed, and reducing the cost of support.

• Lower training costs: After a user is trained in the use of a Web browser, the user is equipped to access and use most of the resources on the corporate intranet.

• Improved return on investment (ROI): Increasing the use of data warehouse spreads its value among more users and shortens the time for data warehouse ROI.

• Remote access: The ability to put information to use out of the office is greatly expanded, because through the Web, users can access the information anytime and anywhere.

• Enhanced customer service and improved image as a technology leader: Up-to-date information can be made available immediately to a wide range of users, allowing them to help themselves and get an immediate response to their questions.

• Greater collaboration among users: Users can share information and analysis across organizations.

.....................................................................................................................................................16-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Challenges of Web-Enablinga Data Warehouse

• Security

• Business value

• Impact assessment

• Setup and management

• Tools and support for global requirements

.....................................................................................................................................................Data Warehousing Fundamentals 16-7

.....................................................................................................................................................Accessing the Warehouse Over the Web

Challenges of Web-Enabling a Data WarehouseAccording to the Hurwitz Group, putting a data warehouse on the Web offers tremendous benefits but it also presents some technical and organizational challenges.

• Security: The loss of data warehouse data to hostile parties can have extremely serious legal, financial, and competitive impacts on an organization. Make sure that your solution has strong encryption, authorization, and authentication services.

• Business value: In order to succeed in Web-enabling your data warehouse, you need to have a warehouse sponsor who will help to develop a clear business case for putting the warehouse on the Web. Some of the questions to answer include:

– What are users going to do with the Web-enabled data warehouse?

– Who will you allow to access the Web-enabled data warehouse?

– What will users be allowed to use the Web-enabled data warehouse for?

– How will this affect other departments, such as order processing, sales, indirect channels and other business partners, and customer support?

• Impact assessment: You need to assess the impact a Web-enabled warehouse will have on your IT organization and infrastructure. This includes:

– Changes in utilization patterns and the number of active clients

– The need to learn new skills, such as integrating a warehouse database with a Web server

– Other areas of consideration: Networks, servers, failover and recovery procedures, development and testing tools, and training programmers as well as operators

• Setup and management: You need to consider how people will use the warehouse and what impact their behavior will have on performance, availability, throughput, and network bandwidth. You need to select among three basic query approaches:

– Static pages

– Dynamic pages

– Dynamic queries

• Tools and support for global requirements: Because putting your warehouse on the Web stresses its load and capacity, you will need good tools for managing the system, especially the network and various servers. You must ensure that your vendors’ support services will meet your global support requirements.

(Source: Robert Craig, Data Warehousing and the Web. Hurwitz Group. September/October, 1997)

.....................................................................................................................................................16-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Common WebData Warehouse Architecture

Gatewayprogram

Warehousedatabase

Common Gateway Interface

Web server

Clientbrowser

HTML

Copyright Oracle Corporation, 1999. All rights reserved.®

Common WebData Warehouse Architecture

Warehouse server

Windowsclients

World Wide Web Client

Clientbrowser

Web server

OLAP server

Common Gateway Interface (CGI)Object Request Broker Cartridge

ServletsNetscape Server API (NSAPI)

Internet Server API (ISAPI)

.....................................................................................................................................................Data Warehousing Fundamentals 16-9

.....................................................................................................................................................Common Web Data Warehouse Architecture

Common Web Data Warehouse ArchitectureThe warehouse may be accessed through a browser using a standard gateway interface. The requestor accesses the Web server, using the Uniform Resource Locator (URL) address. The protocol between the requestor and the server is hypertext transfer protocol (HTTP). The text document that travels between the two servers (Web and requestor) is written using Hypertext Markup Language (HTML).

Warehouses are concerned with real data, not text documents. The Common Gateway Interface (CGI) facility of the Web server software provides a way of executing server resident software, such as a SELECT statement, that accesses a relational database. Building secure applications for the Internet requires a well-thought-out security strategy as well as the appropriate application architecture. Most Web applications provide all users with the same access permissions. The information available is either not confidential or of a low level of confidentiality. The same security issue currently exists at the database level.

Note: As noted in the bottom slide for Common Web Data Warehouse Architecture, the communication mechanism between the OLAP server and Web server can either be any one of the following mechanisms:

• Common Gateway Interface (CGI)

• Object Request Broker Cartridge

• Servlets

• Netscape Server API (NSAPI)

• Internet Server API (ISAPI)

• Other compatible mechanisms

.....................................................................................................................................................16-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Issues in Deploying a Data Warehouseon the Web

• Security:

– Authentication and authorization

– Communication confidentiality

– Access and restriction management

• Scalability

• Availability

Copyright Oracle Corporation, 1999. All rights reserved.®

Security

Authentication and authorization:

– Password

– Digital certificates

– Authentication tokens

Communication confidentiality

Access and restriction management

.....................................................................................................................................................Data Warehousing Fundamentals 16-11

.....................................................................................................................................................Issues in Deploying a Data Warehouse on the Web

Issues in Deploying a Data Warehouse on the Web

SecurityThe Computer Emergency Response Team (CERT), an Internet security watchdog organization, calculates the number of security incidents reported to the center has grown dramatically, from less than 100 in 1988 to almost 2,500 in 1995.

The leakage of data warehouse information through unauthorized access by hostile parties can have extremely serious legal, financial, and competitive impacts on an organization. This is because of access to processed information such as summarized data, trend analysis, and confidential reports used to make business decisions. Such leakage may also not be detected. Security is thus of utmost importance to the data warehouse manager.

To address the security needs, the data warehouse manager needs to pay attention to authentication and authorization, communication confidentiality, and access and restrictions management.

Authentication and Authorization According to CERT:

“Authentication is proving that a user is who he or she claims to be. That proof may involve something the user knows (such as a password), something the user has (such as a smart card), or something about the user that proves the person’s identity (such as a fingerprint).

Authorization is the act of determining whether a particular user (or computer system) has the right to carry out a certain activity, such as reading a file or running a program. Authentication and authorization go hand in hand. Users must be authenticated before carrying out the activity they are authorized to perform.”

(CERT, Security of the Internet (Web version). February 1998.)

There are three means for a user to authenticate himself or herself:

• Something the user knows, such as a PIN or reusable password

• Something the user has, such as a smart card

• Something specific to the user, such as his or her palm print or voice

The three most widely used ways are:

• Password: It consists of a string of characters and is the most basic security measure. Unfortunately, the same password is often used to access different systems and can be captured or stolen. It is better to use onetime passwords.

• Digital certificates: An electronic certificate that identifies users to ensure the successful and authorized transfer of information. The certificate identifies its owner to someone who needs proof of the bearer’s identity.

.....................................................................................................................................................16-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Security

Authentication and authorization:

– Password

– Digital certificates

– Authentication tokens

Communication confidentiality

Access and restriction management

.....................................................................................................................................................Data Warehousing Fundamentals 16-13

.....................................................................................................................................................Issues in Deploying a Data Warehouse on the Web

Security (continued)

Authentication and Authorization (continued)• Authentication tokens: These are small one time password calculators with a

display and sometimes a keypad. Some examples of authentication tokens are smart cards, thumbprint biometric scanning, and retinal pattern biometric scanning.

More advanced security technologies employ at least two of these three factors of user authentication and identification. Factor one is a memorized personal identification number; factor two is a smart card with its displayed code generated at a programmed interval. The two factors combine to produce a onetime password.

Communication Confidentiality Ensure that third parties cannot eavesdrop on communications or impersonate communicating parties. Data that is traversing the Internet should not be readable to unauthorized parties. Encryption, which is the transformation of data into a form unreadable to anyone without a suitable decryption key, is often used to protect data confidentiality. The transformation of data into a form unreadable by anyone without a decryption key.

The two most widely-used types of encryption are symmetric key encryption and public key encryption.

In symmetric encryption, the same key is used to encrypt and decrypt the message. Therefore both the sender and receiver must somehow acquire the key before confidential communication can proceed. This distribution of the key is a point of vulnerability, and if improperly done, the communication can be compromised.

With public key encryption, one key is used to encrypt and a second different key is used to decrypt. The first key cannot decrypt the message and can be sent from the recipient to the sender or even made public. The sender uses this key to encrypt the message for the recipient. This ensures confidentiality in communication but not authentication of the sender.

To provide both authentication and communication confidentiality, you can use digital certificates based on public key encryption. A trusted third party authenticates both parties by some reliable method and issues them digital certificates.

Access and Restriction Management There should be some way to determine across the enterprise whether a particular party has certain privileges or access to valuable resources. When access and restriction management is not controlled in a unified manner there is a possibility that certain parties may still have authorized access even though that is not desired. A directory server is often used as a single point of access and a single point of authentication. Other access management tools are routers and firewalls. Routers can be configured to restrict the flow of network packets to selected portions of the network based on message origin and destination.

.....................................................................................................................................................16-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Security

Authentication and authorization:

– Password

– Digital certificates

– Authentication tokens

Communication confidentiality

Access and restriction management

Copyright Oracle Corporation, 1999. All rights reserved.®

Scalability

• Main concerns are:

– Amount of data

– Complexity of queries

– Number of areas

– Number of users

• Potential bottlenecks are:

– Storage capacity

– Memory

– Computational cycles

– Limits on OS resources

– Network bandwidth

.....................................................................................................................................................Data Warehousing Fundamentals 16-15

.....................................................................................................................................................Issues in Deploying a Data Warehouse on the Web

Security (continued)

Access and Restriction Management (continued)Firewalls restrict the flow of traffic from one network to another based on protocols. Firewalls often include the capabilities of routers. In addition, a firewall can include the capabilities of a proxy server and make requests to external computers on behalf of internal network computers. This hides from the users the configuration of the internal network, such as the name, IP addresses, and OS of internal computers.

ScalabilityWhen enterprises that serve a large population offer service over the Internet, they face unpredictable demands. In particular, they may have to handle peak and special demand loads. With many business and government organizations there are potentially thousands, if not millions of online users. Data warehouse demands also tend to grow rapidly over time. Web-based access to data warehouse might need a high order of scalability as well. This means that the system has to be parsimonious in the use of computing resources per user and should be incrementally extensible through the addition of computing resources.

The main concerns for data warehouse scalability over the Web are:

• The amount of data

• The complexity of queries

• The number of areas

• The number of users

The amount of data that is stored in a data warehouse is substantially greater than for most operating databases and continues to grow with time. Anthem’s data warehouse for example began with 1.3 TB of data and anticipated to grow by 10 times more in three years. Because users are looking for trends and comparing data, it is typical for large amounts of data to be sent to the user per request.

The potential bottlenecks are in:

• Storage capacity

• Memory

• Computational cycles

• Limits on operating system resources such as file handles, ports, and locks

• Network bandwidth

Scalability issues should be considered from the beginning to handle both current needs and future growth. It may be difficult or impossible to make a nonscalable system scalable after implementation. However, it is more cost-effective if resources can be incrementally added only as needed, as growth occurs rather than all at once.

.....................................................................................................................................................16-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Availability

• The Internet extends the reach of databaseapplications throughout the enterprise,organizations, and communities.

• More and more data warehouses require 24 X 7availability

• Maintenance windows for batch extract, process,and refresh information for the data warehouse areshrinking.

.....................................................................................................................................................Data Warehousing Fundamentals 16-17

.....................................................................................................................................................Issues in Deploying a Data Warehouse on the Web

AvailabilityThe Internet extends the reach of database applications throughout the enterprise, organizations, and communities. This reach further highlights the importance of high availability in data management solutions. Small business and global enterprises alike have customers all over the world requiring access to data 24 hours per day and 7 days a week. This is true of many large operational systems but is also becoming the case for data warehouses. One consequence is that maintenance windows are shrinking or disappearing. Secondly, failure in one part of the system does not necessarily make the entire system unavailable.

Maintenance windows are typically used to batch extract, process, and refresh information for the data warehouse. In the future it becomes important to be able to perform such maintenance operations on the data warehouse while it is online. This covers everything from adding disk packs, computers, and data files, to cleaning and refreshing the data from the operational system; to performing backup, archiving, and recovery operations.

.....................................................................................................................................................16-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Evaluating Web-Based Tools

Requirements:

• Interactivity

Does the tool provide interactivity that coverstables, charts, and quadrants?

• Functionality

Calculations, SQL generation, formatting,navigation techniques, layout controls

Copyright Oracle Corporation, 1999. All rights reserved.®

Evaluating Web-Based Tools

Requirements:

• Architecture

What generation of Web architecture doesthe tool require?

• Performance

– How quickly can users access the datathey need?

– How long does it take to downloaddynamic client-side programs?

– What trade-off does the tool makebetween interactivity andperformance?

.....................................................................................................................................................Data Warehousing Fundamentals 16-19

.....................................................................................................................................................Evaluating Web-Based Tools

Evaluating Web-Based Tools

RequirementsWayne Eckerson, from the Patricia Seybold Group, outlined the following requirements for evaluating Web-based query and analysis tools.

Requirement Specific Questions to AskInteractivity Does the tool provide interactivity that covers tables, charts, and

quadrants?

Note: Most tools provide static viewing capabilities.Functionality Compare the functionality of the Web-based tool to the

functionality of its client-server-based version in the area of:

• Calculations• SQL generation• Formatting• Navigation techniques• Layout controlsNote: The Web-enabled tool must meet the requirements of your target audience.

Architecture It is important to consider what generation of Web architecture the tool requires. Specifically consider:

• Does it support a four-tier architecture using CGI interfaces or native Web server interfaces?

• Does it support a three-tier architecture using Java client and server and proprietary client-server protocols?

• Does it use Java applets, ActiveX controls, plug-ins, or helper applications?

• How closely is the tool tracking emerging Internet and Web standards?

Performance A tool that uses native Web server interfaces will run faster in a multiuser environment than tools that use CGI. Consider the following:

• How quickly can users access the data they need?• How long does it take to download dynamic client-side

programs?• What trade-off does the tool make between interactivity and

performance?

.....................................................................................................................................................16-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Evaluating Web-Based Tools

Requirements:

• Design

Does the tool require designers to docoding in HTML or CGI scripts to createsophisticated HTML reports?

• Administration

Does the tool control access to reports byuser, group, and role?

• Output

Can the tool output data in a variety offormats and languages?

Copyright Oracle Corporation, 1999. All rights reserved.®

Evaluating Web-Based Tools

Requirements:

• Scalability

– What platforms does the tool’s mainexecution engine run on?

– Does it support load-balancing?

• Databases

What databases and native drivers does thetool support?

• Pricing

– How much does the tool cost?

– Does the tool support Web pricing?

.....................................................................................................................................................Data Warehousing Fundamentals 16-21

.....................................................................................................................................................Evaluating Web-Based Tools

Requirements (continued)

(Patricia Seybold Group, Wayne Eckerson, Web-Based Query Tools and Architecture. March 1997)

Requirement Specific Questions to AskDesign Does the tool require designers to do coding in HTML or CGI scripts to

create sophisticated HTML reports with drill-down, pivots, and embedded links?

Note: Design is an important factor to consider. Most tools use their existing client-server tools to build reports, which are then published in HTML. However, it is important to know what gets lost in the translation.

Administration The tool must be able to control access to reports by user, group, and role. After users log on to the Web server, they should be presented with a custom menu that shows only those reports that they are authorized to access. Some of the questions to consider are:

• Does the tool have a utility for managing a great many report files on a Web server?

• How does it control user access to reports?• Does it work with existing security features of application servers

and database server?Output A good tool will generate HTML for wide-based distribution as well as

reports in native proprietary format for use with helper applications. Advanced tools should also generate Java for display within a Java window. Specifically consider:

• Can the tool output data in a variety of formats such as grid, crosstab, and chart and in a variety of languages such as HTML, Java, and Excel?

• Which release of HTML does the tool support?Scalability • What platforms does the tool’s main execution engine run on?

• Does it support load-balancing?Databases • What databases does the tool support?

• Does it support both relational and OLAP databases?• Does it use native drivers such as ODBC and JDBC?• Does it support text?

Pricing • How much does the tool cost?• Does it support a Web pricing model?Note: Many companies are starting to charge by concurrent user and the size of the server machine rather than by per-seat charges and flat-fee server pricing.

.....................................................................................................................................................16-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Summary

This lesson covered the following topics:

• Highlighting the main benefits of Web-enabling thedata warehouse

• Discussing the main issues in deploying a datawarehouse on the Web

• Specifying the requirements for Web-based tools

.....................................................................................................................................................Data Warehousing Fundamentals 16-23

.....................................................................................................................................................Summary

SummaryThis lesson covered the following topics:

• Highlighting the main benefits of Web-enabling the data warehouse

• Discussing the main issues in deploying a data warehouse on the Web

• Specifying the requirements for Web-based tools

.....................................................................................................................................................16-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.®

Practice 16-1 Overview

This practice covers the following topics:

• Completing the Web-based tool requirementchecklist

• Justifying each response

.....................................................................................................................................................Data Warehousing Fundamentals 16-25

.....................................................................................................................................................Practice 16-1

Practice 16-1

Web-Based Tool Requirement ChecklistFor each item in the following list which evaluates Web-based tools requirements, rate your own organization’s needs and requirements. Rate each item’s relative importance in measuring your organization’s needs and requirements.

Requirement Specific Questions to Ask Is This Important to You? Why?

Interactivity Does the tool provide interactivity that covers tables, charts, and quadrants?

Functionality Compare the functionality of the Web-based tool to the functionality of its client-server-based version in the area of:

• Calculations• SQL generation• Formatting• Navigation techniques• Layout controls

Architecture • Does it support a four-tier architecture using CGI interfaces or native Web server interfaces?

• Does it support a three-tier architecture using Java client and server and proprietary client/server protocols?

• Does it use Java applets, ActiveX controls, plug-ins, or helper applications?

• How closely is the tool tracking emerging Internet and Web standards?

Performance • How quickly can users access the data they need?

• How long does it take to download dynamic client-side programs?

• What trade-off does the tool make between interactivity and performance?

.....................................................................................................................................................16-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 16: Web-Enabling the Warehouse

Web-based Tool Requirement Checklist (continued)Requirement Specific Questions to Ask Is This Important to You?

Why?Design Does the tool require designers to do

coding in HTML or CGI scripts to create sophisticated HTML reports with drill-down, pivots, and embedded links?

Administration • Does the tool have a utility for managing a great many report files on a Web server?

• How does it control user access to reports?

• Does it work with existing security features of application servers and database server?

Output • Can the tool output data in a variety of formats, such as grid, crosstab, and chart, and in a variety of languages, such as HTML, Java, and Excel?

• Which release of HTML does the tool support?

Scalability • What platforms does the tool’s main execution engine run on?

• Does it support load-balancing?Databases • What databases does the tool support?

• Does it support both relational and OLAP databases?

• Does it use native drivers such as ODBC and JDBC?

• Does it support text?Pricing • How much does the tool cost?

• Does it support a Web pricing model?

.................................

17

Managing the DataWarehouse

.....................................................................................................................................................17-2 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Overview

Project Management (Methodology, Maintaining Metadata)

DefiningDW Concepts& Terminology

Planningfor a

SuccessfulWarehouse

AnalyzingUser Query

Needs

Choosing aComputing

Architecture

Modeling the Data

Warehouse

PlanningWarehouse

Storage

ETT(Building theWarehouse)

Meeting aBusiness

Need

SupportingEnd UserAccess

Managing the Data

Warehouse

Managing the Data

Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Objectives

After completing this lesson, you should be able todo the following:

• Develop a plan for managing the transition fromdevelopment to implementation

• Identify challenges pertaining to the growth of thedata warehouse

• Describe backup and archive mechanisms

• Identify data warehouse performance issues

.....................................................................................................................................................Data Warehousing Fundamentals 17-3

.....................................................................................................................................................Overview

OverviewThis lesson explores the management issues, critical success factors, and challenges to successful data warehouse implementation. The lesson addresses issues pertaining to the management of the entire warehouse life cycle.

Note that the “Managing the Data Warehouse” block is highlighted in the overview slide on the facing page.

ObjectivesAfter completing this lesson, you should be able to do the following:

• Develop a plan for managing the transition from development to implementation

• Identify challenges pertaining to the growth of the data warehouse

• Describe backup and archive mechanisms

• Identify data warehouse performance issues

.....................................................................................................................................................17-4 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Managing the Transition to Production

• Promoting support for change

• Pilot versus large-scale implementation

• Documentation

• Testing

• Training

• Postimplementation support

• Maintaining the warehouse

.....................................................................................................................................................Data Warehousing Fundamentals 17-5

.....................................................................................................................................................Managing the Transition to Production

Managing the Transition to ProductionAnother set of key management issues surrounds the transition from warehouse development to production. These issues include:

• Promoting the support of management, developers, and end users for the changes accompanying the warehouse

• Choosing between a manageable pilot and large-scale implementation

• Documentation

• Testing

• Training

• Postimplementation support

• Maintaining the warehouse

.....................................................................................................................................................17-6 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Promoting Support for Change

To Support Not to Support

Management Competitiveness Fear of change

Business benefit Risk avoidance

Developers New skills Outdated skills

Leading edge

End Users Faster flexible system Disruption

Improved tools Change

Increased workload

Copyright Oracle Corporation, 1999. All rights reserved.

Methods for Promoting Support

• Awareness

• Feedback

• Information

• Skills

• Education

• Direction

• Control

.....................................................................................................................................................Data Warehousing Fundamentals 17-7

.....................................................................................................................................................Managing the Transition to Production

Promoting Support for ChangeUnfortunately, not everyone easily tolerates or accepts the introduction of new systems and associated technologies. End users and information systems personnel, particularly, are often bombarded with new systems and technology.

There are reasons why staff may be either for or against supporting the warehouse.

Methods for Promoting Support Given the fears identified, there are ways you can control transition to this new and exciting but challenging environment. Some of these may be obvious; however, they are worth stating.

• Ensure that everyone is aware of the benefit the warehouse is going to bring to the business. A profitable organization is able to grow, compete, adapt, and keep staff.

• Ensure that all staff involved in the warehouse project are aware of what is happening at each stage. Provide constant and consistent feedback on status, including problems and successes.

• Ensure that the IT staff are trained with the skills they need (old and new).

• Provide users with the training necessary to use the query tools effectively and imaginatively.

• Keep the project on course. Do not let any phase of development skip without understanding why, and learn for the next increment. Monitor progress constantly.

Reasons to Support Reasons not to Support

Management Competitive advantage

Benefit from the investment

Fear of change

Risk avoidance

Developers Opportunity to learn new and valuable skills

Leading-edge technology

Fear of obsoleting old skill set

End Users Faster and more flexible systems

Improved and more powerful query tools

Disruption of routine

Change of toolset

Increased workload

.....................................................................................................................................................17-8 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Choosing Between Pilot and Large-ScaleImplementation

Pilot

Large-ScaleImplementation

Copyright Oracle Corporation, 1999. All rights reserved.

The Warehouse Pilot

• Demonstrates benefits to:

– Management

– Users

– IT staff

• Relevant to the business

• Low technical risk

• Small and feasible

• Anticipates increased use

• Focused on an initial business issue

• Remains in context

.....................................................................................................................................................Data Warehousing Fundamentals 17-9

.....................................................................................................................................................Managing the Transition to Production

Choosing Between Pilot and Large-Scale ProductionThis choice should have been already made at an earlier planning stage. The preferable choice is a pilot, the success of which can be leveraged into further incremental rollouts.

The Warehouse PilotThe pilot demonstrates benefits to management, end users, and IT staff.

Management The warehouse can provide current and ongoing financial benefits to the business.

End Users The types of information available, the flexibility of the tools, and the type of analysis possible.

IT Staff Whether their strategy and development plans were appropriate. Changes can be made prior to developing the next increment.

Essential considerations for the pilot are to:

• Ensure that the subject matter chosen is relevant to the business. Thus, the pilot may focus on an initial business issue such as sales or marketing.

• Have a low technical risk by starting small and feasible. It may be that the pilot data comes from a single relational source and therefore is most likely to succeed as a proof of concept. Further iterations may extract data from diverse sources.

• Anticipate significant use.

• Ensure that the pilot, however small, remains within the context of the larger vision.

.....................................................................................................................................................17-10 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Piloting the Warehouse

• Designers

Prove model, data, and access tools• Users

– Prove ease of use of tool– Check data and query performance– Identify training requirements

• Developers– Resolve ETT and metadata issues– Determine users data and training

requirements– Test security and access levels, monitor

performance

.....................................................................................................................................................Data Warehousing Fundamentals 17-11

.....................................................................................................................................................Managing the Transition to Production

Piloting the WarehouseYou can position the pilot, or prototype, as the starting point of the iterative warehouse process mentioned earlier. It is a vital part of the implementation.

The pilot must cover all aspects of implementation and ensure user involvement at every step in the process or phase of the life cycle. A specific subject area of the warehouse is targeted for the pilot, and the query tools selected should be available to the users for data access.

The pilot fulfills a number of tasks, including those in the following list:

• It enables the designers to prove the model, the data, and the access tools.

• It enables the users to:

– See how easy the access tool is to use

– Enhance their data requirements

– Identify their training requirements

– Measure query performance

• It allows the developers to:

– Determine whether the ETT process is adequate and modify it accordingly

– Identify any issues with the metadata presented to the users or used by ETT

– Determine the users’ near future and possibly even long-term requirements

– Identify and define the users’ training needs

– Test access levels and security of the systems and data

– Monitor performance

Several things must be agreed upon before piloting:

• You must ensure that acceptance criteria are documented and agreed upon.

• You need to identify volume and scalability tests and develop a test plan with test cycles.

• Once the tests are executed, you can gather statistics on performance and optimize where necessary.

You must test the entire process of refreshing the data, and produce a report that contains a complete and detailed evaluation of this proof of concept.

.....................................................................................................................................................17-12 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Documentation

Produces textual deliverables:

• Glossary

• User and technical documentation

• Online help

• Metadata reference guide

• Warehouse management reference

• New features guide

Copyright Oracle Corporation, 1999. All rights reserved.

Testing the Warehouse

• Test every stage

• Use a realistic test database and environment

.....................................................................................................................................................Data Warehousing Fundamentals 17-13

.....................................................................................................................................................Managing the Transition to Production

DocumentationThis process centers on producing all user and technical documentation for the data warehouse, including references, user and system operations guides, and online help.

Metadata Reference Guide To ensure active and successful use of the warehouse, the metadata reference guide describes the contents of the data warehouse in business terms and provides a navigational road map to the contents of the warehouse.

Warehouse Management Reference The warehouse management documentation outlines the workflow and procedures (both manual and automated).

New Features Guide The new features guide highlights any enhancements to warehouse functionality that results from the implementation of the solution.

Testing the WarehouseDo not assume, “No problem, it will work.” Always test components.

Test Database Testing is required at every stage of development, involving every component, ideally on a test database, using a machine and network setup as close as possible to the planned production environment.

If you are using Oracle Data Warehouse Method, testing is a specific requirement during most phases and for many tasks.

.....................................................................................................................................................17-14 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Training

• Users– Metadata– DSS tools– Ad hoc queries– Getting help– Registration of enhancement requests

• Information systems developers:– Analysis techniques– Hardware technicalities– Networking– Implementing, building, and supporting DSS

.....................................................................................................................................................Data Warehousing Fundamentals 17-15

.....................................................................................................................................................Managing the Transition to Production

TrainingDuring project planning, allocate time and resources to educating key information technology staff, end users, and management personnel about data warehousing and its benefits. Education begins at the start of development and continues right through to the end and on to further iterations.

Educating Users Educating users on how to access data is one of the most critical areas of warehouse training. Always ensure that representatives from each user group are invited to courses and workshop sessions.

Users need to know how:

• The metadata represents the business data

• To use the decision support tools to answer business questions

• To create ad hoc queries and save data results

• To contact the help desk or support group for assistance

• To register requests for enhancements through a formal change management process

Educating IT Staff Information systems staff need education in the following areas:

• How to communicate and understand people issues

• Business analysis techniques

• Technical aspects of the hardware architecture

• The network environment

• Decision support and OLAP tools—implementing, building, and supporting

Educating everyone involved with the warehouse is more critical for the first implementation. Everyone must be made aware of what the warehouse is, even if they are not directly involved with the project.

.....................................................................................................................................................17-16 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

• Evaluate and review the implementation

• Monitor the warehouse:

– Respond to problems

– Conduct performance tuning

– Roll out metadata, queries, reports, filters, andconditions

– Implement security

– Incorporate new users

– Distribute data marts and catalogs

– Transfer ownership from IT

Postimplementation Support

.....................................................................................................................................................Data Warehousing Fundamentals 17-17

.....................................................................................................................................................Managing the Transition to Production

Postimplementation SupportThis process provides an opportunity to evaluate and review the implementation. You access metadata and evaluate queries and reports run against the warehouse. The information assists with managing standard queries and reports and the user layer and identifies required indexes.

Monitoring the Warehouse After implementation, you will need to monitor the warehouse continuously to manage the following:

• Monitoring and responding to system problems

• Conducting performance and tuning activities for all components of the data warehouse

• Rolling out metadata, queries, reports, filters, and conditions

• Implementing security

• Incorporating new users

• Distributing data marts and catalogs

• Transferring ownership (responsibility for the data warehouse may be transferred from IT personnel to the owning organization.)

.....................................................................................................................................................17-18 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Managing Growth

Expanding user numbers

Initial

3 Months

6 Months

12 Months

24 Months

0

50

100

150

200

250

300

Initial

3 Months

6 Months

12 Months

24 Months

Nu

mb

er o

f U

sers

Period after Implementation

Source: Data Warehouse Institute Flash Report, January 1996

Copyright Oracle Corporation, 1999. All rights reserved.

Types of Growth

• Increasing number of users

• Broader usage

• Growth of data volumes

.....................................................................................................................................................Data Warehousing Fundamentals 17-19

.....................................................................................................................................................Managing Growth

Managing GrowthThe table below is the result of a survey showing that the number of users accessing the successful warehouse grows substantially during the first two years. You can see that between 12 and 24 months there is a substantial rise in use.

Once the benefits of the warehouse become tangible to the user community, demands on the warehouse increase dramatically.

The table and chart are sourced from the Data Warehouse Institute Flash Report, January 1996.

Types of Growth• Increasing number of users

• Broader, more varied usage

• Growth of data volumes

The database increases in size through the accumulation of historical data and addition of new subject areas.

Warehouse usage increases through the availability of new decision support functionality and evolving empowerment of the user population.

Number of Users Actively Querying the Warehouse

Period Large and small sites Small sites

Initial Number 16 6

After 3 months 19 12

After 6 months 44 20

After 12 months 99 28

After 24 months 255 55

.....................................................................................................................................................17-20 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Expansion and Adjustment

• Evaluate continually:– Changes– New increments– Unnecessary components– Strategies

• Ensure open environment• Document development processes for the future:

– Planning– Cost analysis– Problem assessment and correction– Performance assessment

Copyright Oracle Corporation, 1999. All rights reserved.

Controlling Expansion

Control by

• Ensuring the continuity of staff

• Documenting processes, solutions, and metrics

• Establishing working test and productionarchitecture for further increments

• Creating a strategy for maintaining changes todata

.....................................................................................................................................................Data Warehousing Fundamentals 17-21

.....................................................................................................................................................Managing Growth

Expansion and AdjustmentContinually evaluate the warehouse to identify:

• Changes that can be made

• Additional increments (although this is usually identified in the primary strategy phase of development)

• Components that may be removed (for example, unused summaries)

• Optimal indexing and performance strategies

Openness for the Future An open architecture and toolset is required to suit current and future requirements.

Document for the Future You should document the process used in developing the data warehouse solution and collect metrics, as an aid to:

• Future planning

• Further and future cost analysis of current or new projects

• Identification of errors and inadequacies that can be eliminated for the next project

• Assessing tool performance

Note: The DWM Transition to Production Phase creates tasks for these postimplementation issues. The Discovery Phase evaluates all warehouse components.

Controlling ExpansionTo control the expansion and adjustment process, and to promote its success, you should:

• Ensure the continuity of staff on warehouse projects

• Document the process used in developing the warehouse solution and metrics

• Establish a working test and production architecture that can be used for further increments

As organizational structures change, the historical data reflects a different story. Determine a strategy for managing changes to the data.

.....................................................................................................................................................17-22 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Sizing Storage

• Consider different methods

• Determine the best for your needs

• Know the business requirements

• Do not underestimate requirements

• Plan for growth

• Consider space for unwanted data

.....................................................................................................................................................Data Warehousing Fundamentals 17-23

.....................................................................................................................................................Managing Growth

Sizing StorageSizing data storage (or capacity planning) takes place at a number of stages, for each increment of the data warehouse solution. It is often revised before being finalized. Sizing must take into account all the object space needed, not just the database itself with the warehouse data.

Do Not Underestimate Capacity planning is an art in itself. There are many objects for which space must be accurately estimated, such as tables, indexes, logs, sort areas, and temporary space. You may think that this is not much different from the operational system; however, with the warehouse you are looking at very large databases with very large space requirements.

It is all too common, when sizing, to forget these additional objects.

Planning for Growth In addition, your early planning stages must consider the growth of these areas. The data warehouse grows exponentially once implemented, at every refresh cycle, and space must be available for that growth.

Removing Unwanted Data When data is not needed, it is either purged (removed and never used again) or archived (for possible later use). Consider the space and location of archive data. Pay careful attention to determining the storage requirements for the warehouse. This includes space for:

• Data—fact, dimension, reference, and summary

• The staging file store

• Indexes

• Backup and recovery strategies

• Temporary files

.....................................................................................................................................................17-24 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Estimating Storage

• Fact volumes

• Fact lifetime

• Technology availability

• Technology purchase

• Storing pre-summarized data

• Mirroring or other techniquesrequiring disk storage

Copyright Oracle Corporation, 1999. All rights reserved.

Objects That Need Space

• ODS

• Indexes and metadata

• Summary data

• Redo logs

• Rollback information

• Sort areas

• Temporary space

• Workspace for backup and recovery

.....................................................................................................................................................Data Warehousing Fundamentals 17-25

.....................................................................................................................................................Managing Growth

Estimating StorageIn any discussion on this subject, you find a vast array of different ideas, opinions, methods, and approaches. There is no one single recommendation. You need to consider the different approaches that are possible, choose the best for you and your data warehouse, and keep it simple. You should never underestimate the amount of space needed in the data warehouse.

In order to estimate accurately, you need to answer some simple questions about your data:

• What is the expected volume of core fact data?

• What is the lifetime of core fact data?

• Do you have the technologies to support that volume?

• If not, do you need to purchase the technologies?

• How important is storing pre-summarized data?

• Does your recovery strategy involve mirroring or other techniques requiring disk storage?

Objects That Need SpaceA detailed understanding of available data is essential for planning capacity at an early stage. Capacity planning is ongoing throughout the life of the warehouse. Consider disk requirements:

• Intermediate data store (This is sometimes implemented as an Operational Data Store (ODS) and referred to as a staging area. It holds data that has been extracted from source systems, prior to being loaded into the warehouse.)

• Indexes, of which there may be many more than in normal operational systems

• Metadata that contains the map to the warehouse structure and content

• Summary data that comprises aggregated data

• Redo logs and rollback information

• Sort areas and temporary space

• Load files moved to the server

• Workspace for backup and recovery

.....................................................................................................................................................17-26 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Test Load Sampling

• Analyze statistically significant datasamples

• Use test loads for different periods

• Reflect day-to-day operations

• Include seasonal data and worstcase scenarios

– Calculate number oftransactions

– Employ average sales priceapproach

.....................................................................................................................................................Data Warehousing Fundamentals 17-27

.....................................................................................................................................................Managing Growth

Test Load SamplingYou have to decide on the capacity planning technique that suits you best. You may already have a method that is successful for your operational environment and can be enhanced for VLDBs and other warehouse objects, such as ODSs.

A good approach to sizing is based on the analysis of a statistically significant sample of the data.

Test loads can be performed on data from a day, a week, a month, or any other period of time. Care must be taken that the sample periods reflect the true day-to-day operations of your company, and the results include any seasonality issues or other factors, such as worst-case scenarios that may otherwise prejudice the results. Once you have determined the number of transactions based on the sample, then you calculate the size by using the average sales price approach.

.....................................................................................................................................................17-28 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Average Sales Price

Assume transaction-level grain.

Total company revenue

Avg sale price per line item

Number of line items per year

Number of base fact records

Key fields

Fact fields

Base fact table size

$20 billion

$5

$20 billion / $5 = 4 billion

4 billion x 3 yrs = 12 billion

4 (x 4 bytes)

4 (x 4 bytes)

12 billion x 8 fields x 4 bytes= 385 GB

Copyright Oracle Corporation, 1999. All rights reserved.

Average Sales Price

Use other methods:

• It is difficult to obtain an accurate average

• You can achieve inaccurate calculations

Do not use this approach on its own

.....................................................................................................................................................Data Warehousing Fundamentals 17-29

.....................................................................................................................................................Managing Growth

Average Sales PriceThe following calculation shows how to estimate the amount of direct-access storage device (DASD) needed for three years’ worth of data, using an average sales price algorithm.

If you take your company’s annual gross revenues and divide by the average revenue per transaction, then multiply this figure by the length of the row (key columns and data columns) in your fact table, you have the amount of DASD needed for a year's worth of data.

You should never use this approach on its own; it is simplistic. The problem is that it is difficult to get the average revenue per transaction. It is unusual to have a set price point or even a relatively narrow price range for the products offered by any company. Many companies have products that sell in volume at relatively low prices, say $5, and they may have low-volume big-ticket items as well, all of which distort the average.

For example, if the average used is $5, you need 385 GB of DASD, but if the average is in reality $10, you need only 192 GB of DASD.

Note: This approach is one that is recommended by Ralph Kimball, and takes a business view rather than a technical view to sizing.

Total company revenue $20 billionAverage sales price per line item on an individual customer receipt

$5

Number of line items per year for total business

$20 billion / $5 = 4 billion

Number of base fact records 4 billion × 3 (years) = 12 billionNumber of key fields 4 (assume 4 bytes per field)Number of fact fields 4 (assume 4 bytes per field)Total fields 8Base fact table size 12 billion × 8 fields × 4 bytes = 385 GB

.....................................................................................................................................................17-30 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Other Techniques and Considerations

• Queuing models

• Rule of thumb

Total database size is three to four times the sizeof the base fact tables

• Consider:

– Sparseness

– Dimensions

– Indexes

– Summaries

– Sort operational space

Copyright Oracle Corporation, 1999. All rights reserved.

Space Management

• Monitor

• Avoid fragmentation

• Test load data

• Plan for growth

• Know business patterns

• Never let space become an issue

.....................................................................................................................................................Data Warehousing Fundamentals 17-31

.....................................................................................................................................................Managing Growth

Other Techniques and Considerations

Queuing Models Mathematical models can predict response time on throughput.

Rule of Thumb This rule is often quoted within Oracle. Depending upon the database server and end-user tools, the total database size is three to four times the size of your base fact tables.

Other Considerations• Sparseness of data in fact tables

You must also consider the sparseness of data. Fact table data is generally sparse; relatively few of all the possible key value combinations are present. Summary tables are not considered sparse. That is, they contain values for every possible key value combination.

• Large dimension tables

• Significant increase in size of database caused by indexes

• Large summary tables. Sometimes they occupy as much space as the base fact table. There may be hundreds of summary tables for a warehouse implementation.

• The need for sort operational space for sorting and loading

Note: You may consider using leasing and chargeback strategies for any excess storage capacity, especially in a massively parallel processor (MPP) configuration.

Space ManagementYou have determined a technique for planning capacity and are aware of the numerous objects that need space; you need to consider management of this space:

• The space usage must be monitored and any fragmentation noted and resolved.

• You should load test sets of data and consider careful analysis (use the ANALYZE command) to estimate average row length and rows per block, to predict whether you have sufficient capacity.

• You need to consider how the database is going to grow, and plan for additional storage accordingly. Fact data grows rapidly, depending upon the refresh cycle frequency; it grows every time a refresh occurs.

• Knowing the patterns within your business is key to planning these requirements.

Never allow space to become an issue in a warehousing environment; you can see, with all the operations discussed, how important it is.

.....................................................................................................................................................17-32 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Managing Backup and Recovery

• Business requirements for availability

• Fast recovery essential

• Strategy:

– Defined

– Tested

– Proven

– Evolving

Copyright Oracle Corporation, 1999. All rights reserved.

Backup Strategy

• Is based on the business requirements and thecost benefit

• Involves large volumes of data:

– All objects except temporary tablespaces

– Incremental

• Includes first-time loadand refreshes

.....................................................................................................................................................Data Warehousing Fundamentals 17-33

.....................................................................................................................................................Managing Backup and Recovery

Managing Backup and Recovery

AvailabilityAvailability is the key requirement of mission-critical data warehouses; recovering after any type of failure must happen fast.

Some companies demand round-the-clock availability, making partial recovery mechanisms imperative.

Backup Strategy To recover the database quickly implies that there is a well-defined, tested, and proven backup strategy, as well as a disaster recovery strategy in case of fire, flood, or infestation.

Evolving Strategies Ensure that as the warehouse evolves, the backup and recovery strategies also evolve synchronously. Test the backup and recovery procedures constantly to ensure that they are relevant to your current environment.

The strategy you deploy is based on the business requirements and the cost benefit. The strategy is not just when and what to back up, but what tools and utilities you are going to use.

Backing up data is different in the data warehouse environment. You are dealing with much larger volumes of data than operational systems and higher availability requirements.

What to Backup Everything in the data warehouse must be the subject of backup, except temporary tablespaces; that is, the data and tables, metadata, indexes, constraints, stored procedures, and triggers.

When to Backup A critical part of your overall strategy is to determine when the database needs to be backed up. This is no different from an operational environment, except that the frequency of changes to data is unlikely to be as great as that in the operational environment.

You should back up after the first-time load, after incremental refreshes, and after any changes to the database structure, such as adding fact or summary tables. Incremental backups are used, because the data is static between loads.

You need to outline the strategy to include full and incremental backups as you would in an operational environment.

.....................................................................................................................................................17-34 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Defining the Strategy

• Mission-critical systems

• SLAs:

– Defined downtime

– Acceptable MTBF

• Efficient backup and recovery

• Evaluation of differenttechnologies

Copyright Oracle Corporation, 1999. All rights reserved.

Planning for Backup

• Plan at the design stage

• Use hot backups for VLDBs

• Back up necessarycomponents:

– Fact and dimension data

– Warehouse schema

– Metadata schema

– Metadata

• Export/Import utility:

– Disk space

– Time

.....................................................................................................................................................Data Warehousing Fundamentals 17-35

.....................................................................................................................................................Managing Backup and Recovery

Defining the StrategyAll backup and recovery strategies and tasks must outline and mirror the fact that the data warehouse contains valuable mission-critical data.

A service level agreement (SLA) is drawn up between yourselves and the customer (the users) in the early stages. The SLA should at least define what downtime means (each user may have a different perspective on this) and the acceptable mean-time-between-failure (MTBF) figures.

The backup hardware environment must be as efficient as possible, considering the implications and technicalities of deploying RAID, striping, mirroring (some parts of the database need to be mirrored, others can employ RAID), or partitioning (backup partitions of data rather than an entire database).

Planning for BackupThe backup and recovery strategy for a warehouse needs to be considered at the design stage. Details such as how the data is partitioned greatly affect the strategy. For small and medium databases, daily cold backups (taken while all instances of the database are shut down) and export/import are viable backup tools.

However, once you move to VLDBs, complete cold backups become difficult to fit into an overnight window. In addition, the disk space required for a complete export of a large database becomes an issue. You need to consider other strategies such as using tape or other devices.

The defined backup strategy for the warehouse should allow for hot backups, where you can back up any part of the database at any time of the day, while the database instances are still active. With Oracle, this means backing up individual and active tablespaces.

You should back up every component that is essential to warehouse operations; everything required to restore a working environment: fact data, dimension data, data warehouse and metadata schema, and data warehouse metadata.

Export/ImportThe export/import utility enables an entire or part of a database to be extracted into a dump file and then imported into another database (under another owner if required). Generally, import/export of a VLDB uses too much disk space. You could use named pipes to a disk on a UNIX system to overcome space problems. However, this technique would be very time-consuming.

.....................................................................................................................................................17-36 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Backup Tools

• Oracle7 Enterprise Backup Utility

• Oracle8 Recovery Manager

• Utilities:

– Import and Export

– Operating system

– Third party

Copyright Oracle Corporation, 1999. All rights reserved.

Parallel Backup and Recovery

• Parallel Backup

Runs simultaneously from any node

– Off-line

– Online

• Parallel Recovery

Runs simultaneously from redo logs

.....................................................................................................................................................Data Warehousing Fundamentals 17-37

.....................................................................................................................................................Managing Backup and Recovery

Backup Tools

Oracle7 Enterprise Backup Utility (OEBU) This provides a user-friendly interface, documentation, and the recording of backup details in a recovery catalog.

Oracle8 Recovery Manager (RMAN) Oracle8 Recovery Manager creates image backups and incremental backups. RMAN stores the information from multiple data files (or archive logs, but not both) in a backup set, stored in a format that cannot be processed directly (similar to the Export.dmp file principle). RMAN performs either cold or hot backups.

OEBU and RMAN are very useful in the VLDB environment to ensure that tasks occur without error.

Utilities• Oracle Import and Export

• Operating system utilities, such as UNIX cpio or tar commands, VMS EXCHANGE, and Windows NT ocopy73.exe or ocopy80.exe

• Third-party utilities that provide a user-friendly layer over operating system backups

Parallel Backup and Server

Parallel Backup With parallel operations, backups can be performed simultaneously from any node of a parallel server.

• Online backups enable the database to be backed up while active, allowing users continuous access.

• Offline backups enable the database to be backed up while shut down, preventing user access.

Parallel Recovery The goal of parallel recovery is to employ I/O parallelism to reduce the elapsed time required to perform crash recovery, instance recovery, or media failure recovery. The server uses one process to read files sequentially and dispatch redo information to several recovery processes to apply the changes from the log files to the data files.

.....................................................................................................................................................17-38 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

System Failures

• Process

• Database instance

• Media

• Natural disaster

Failures are costly

.....................................................................................................................................................Data Warehousing Fundamentals 17-39

.....................................................................................................................................................Managing Backup and Recovery

System FailuresObviously, system failure can be very costly in a warehouse environment. The causes fall into four categories:

Process Failure Strict and rigorous testing of your plans should prevent this situation occurring on a regular basis; however, you cannot afford to ignore the fact that it may happen. Identify an approach to monitoring processes and detecting errors and a mechanism for reapplying the failed processes.

Database Instance Failure Instance failure occurs when the Oracle SGA and background processes cannot work. Failure is typically caused by:

• Hardware problems such as power failure

• Software problems such as an operating system crash (hanging)

In an instance failure, data in buffers not yet written to disk will be lost.

Media Failure Media (disk) failure occurs when errors are detected writing or reading data from disk. It is often caused by disk head crash and affects different types of file such as data files, redo logs, and control files.

Media failures mean that data in buffers not yet written to disk is lost.

Natural Disasters Natural occurrences such as flood and fire may result in the system becoming unusable.

.....................................................................................................................................................17-40 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Disaster Recovery Requirements

• Replacement or standby machine

• Tape and disk capacity

• Communication links to users and data

• Copies of software

• Database backup

• Administration and operations staff

• Documentation

Copyright Oracle Corporation, 1999. All rights reserved.

Disaster Recovery Planning

• Establish the strategy

• Prepare the strategy

• Maintain the strategy

• Audit the strategy

• Test recovery plan regularly

• Gain approval from users

.....................................................................................................................................................Data Warehousing Fundamentals 17-41

.....................................................................................................................................................Managing Backup and Recovery

Disaster RecoveryProtecting your investment is of the highest consideration. A disaster occurs when a major site loss takes place; usually the site has been destroyed or damaged beyond immediate repair.

RequirementsRecovering from disaster requires the following facilities:

• A replacement, or standby, machine

It does not have to be as large as the main machine but must have sufficient capacity to run a minimal system and the power to allow the recovery to take place on a meaningful timescale.

• Sufficient tape and disk capacity to perform the recovery on a reasonable timescale

Having sufficient disk space to run the minimum independent system is not always enough. You may need extra disk capacity to allow initial recovery to happen on a reasonable timescale.

• Communication links to and from users and data owners

• Communication links to and from data sources

If the system is to be accessible to users, the communication links they need to access the machine must be in place. The links must have sufficient bandwidth and capacity. This is particularly important if the links are already in use by other systems. There is no point in putting a disaster system in place if the users cannot use it.

• Copies of all relevant pieces of software and licensing agreements

• Backup of database

• Application-knowledgeable systems administration and operations staff, along with current documentation in written or electronic format

PlanningYou should thoroughly test the disaster recovery plan on a regular basis, say every six months. New versions of systems, software, and data are constantly being added and the frequency of the test must take into account these ongoing changes. The strategy is normally audited: you need someone to establish, prepare, and maintain the strategy. The plans must be approved by the business and information systems users.

.....................................................................................................................................................17-42 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Archiving Data

• Determine data life expectancy

• Identify archive frequency

• Use read-only tablespaces

• Plan and design into early specifications

Copyright Oracle Corporation, 1999. All rights reserved.

Purging Data

• Reduce data volumes:

– Create summaries.

– Remove unwanted base data.

• Choose the most effective method.

.....................................................................................................................................................Data Warehousing Fundamentals 17-43

.....................................................................................................................................................Managing Backup and Recovery

Archiving DataThe warehouse design needs to estimate and accommodate the data life expectancy.

Establish how long you want to hold data before removing it completely from the live database. You may be required to archive old data to tape, or to another database. In small and medium warehouse databases, the amount of data involved is generally small; in larger databases the data volumes involved may be significant.

Read-only Tablespaces Data warehouse databases, due to their size, require a backup method that is as fast as possible and reduces the amount of data to be backed up. You should use partitioned read-only tablespaces that enable you to archive the tablespace while it is read-only mode.

• You do not have to back up a read-only tablespace after making the first backup.

• Read-only tablespaces reduce the cost of archive storage. They can be stored on less expensive media such as a CD-ROM. Ensure that the device to which you are writing can be accessed quickly.

• As part of your archive strategy, you can use read-only tablespaces to hold infrequently accessed data.

Data archiving can impose an ongoing heavy load on the system; if you do not plan for this in the design and implementation, it can have a detrimental effect on performance.

Purging Data You may be able to reduce the amount of data held by summarizing and aggregating older data. For example, you may be able to summarize data into monthly and weekly summaries at the end of each month, and then remove the detail fact data. This data should be stored offsite in case it is needed to re-create the summary files.

When you remove data, always choose the most cost-effective method in terms of CPU and database resources. For example, in the case of Oracle, use the DROP table command (if the table is partitioned) rather than the DELETE command to remove the unwanted rows. The DROP command does not create rollback and redo information.

.....................................................................................................................................................17-44 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Improving Query Efficiency

• Improve database design

• Use indexes

• Use governors

• Use prepared and testedqueries

• Run large jobs out of hours

• Oracle 8i Resource Manager can guaranteeresource availability to specific groups

• Use data marts

Copyright Oracle Corporation, 1999. All rights reserved.

Network Performance

• Provide sufficient bandwidth

• Provide optimal configurationfor access

• Identify middleware requirements

• Know refresh volumes

• Consider interaction with job scheduling software

• Use client-side processing

• Deploy data marts

• Analyze traffic

.....................................................................................................................................................Data Warehousing Fundamentals 17-45

.....................................................................................................................................................Identifying Data Warehouse Performance Issues

Identifying Data Warehouse Performance Issues

Improving Query EfficiencyThe basic design has many implications on performance. A poor design is never going to provide efficient access; consider redesigning the database.

• Ensure that indexes exist on key values to minimize full-table scans.

• Always use the SELECT command to obtain the minimum amount of data required.

• Administer resource governors—query blocking—on the server and with the tools where they have governing capabilities.

• Make available the use of prepared and pretested queries.

• Submit large jobs out of working hours, or when CPU usage and network and I/O contention are minimum.

• Oracle 8i Resource Manager can guarantee resource availability to specific groups.

In addition to the above considerations, you may also consider using a data mart strategy to offload query actions to a smaller subset of the warehouse data.

Network PerformanceThe data warehouse environment is commonly distributed (a data warehouse feeding data marts), using networks to provide data transfer mechanisms. The network must be planned and set up to meet data movement and access requirements. Users should not have restricted access to data. You need to:

• Ensure that the network has an appropriate bandwidth particularly for load processing.

• Ensure the configuration of the environment is optimal for user access to data.

• Identify whether any middleware is needed to convert data or read non-Oracle data.

• Identify update frequencies and ensure the network is capable of handling the volumes.

• Consider how the job scheduling software interacts with the network setup.

• Use tools that perform intensive processing activities (such as summarizing and sorting) on the client side, or the server itself may perform these activities.

• Deploy data marts at remote locations.

Analyzing Network Traffic You should consider using tools to analyze current activity and aid in the preliminary planning of the requirements.

.....................................................................................................................................................17-46 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Review and Revise

Monitor the warehouse:

• Usage

• Access

• Accurate grain

• Detail data

• Periodicity

.....................................................................................................................................................Data Warehousing Fundamentals 17-47

.....................................................................................................................................................Identifying Data Warehouse Performance Issues

Review and ReviseOnce the data warehouse is in use, you should monitor it and determine the data that is being accessed, and the frequency of that access.

You should also use this information to determine whether the grain of the data is right for the user requirements. Often data may have to be stored at different levels of granularity to answer sophisticated user queries. This is referred to as multiple granularity. If a user often requests simple annual sales figures for a given product, this may be satisfied with a summary table. If the user requests sales figures for a product by month, then you can provide the same information from 12 time-series tables. Of course, this involves extra processing.You need to determine early on the levels of granularity, and how long they are to remain in place in the warehouse.

You should balance the issues against your requirements and resources:

• How often is detail data access required? This determines the real need for details and their duration.

• What are the benefits of keeping detail for a specified period?

• Do the benefits outweigh the cost in machine resources?

These questions, and others, can be answered in part with stringent query monitoring to give you usage information. Use this to calculate benefits against costs.

.....................................................................................................................................................17-48 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Secret of Success

Think big start small

.....................................................................................................................................................Data Warehousing Fundamentals 17-49

.....................................................................................................................................................Identifying Data Warehouse Performance Issues

Secret of SuccessYour eventual goal may be the enterprisewide solution, but take small steps to achieve it. The enterprisewide warehouse is not a realistic objective for your first pass. Always use the proven low-risk incremental approach.

.....................................................................................................................................................17-50 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

Copyright Oracle Corporation, 1999. All rights reserved.

Summary

The successful warehouse:

• Is driven by the business

• Focuses on objectives

• Adds value to the business

• Can be understood and used

• Delivers good data

• Performs well

• Belongs to the users

.....................................................................................................................................................Data Warehousing Fundamentals 17-51

.....................................................................................................................................................Summary

SummarySuccess is achieved if the data warehouse:

• Is driven by a business community with clearly identified requirements. Remember that this is the primary objective of the data warehouse, and the users must be responsible for driving the end result.

• Focuses on the objectives outlined in the early stages of development.

• Adds value to the decision making process, and can be seen to provide value with better and proven results. It is important that you define the measurement of the success of the warehouse. Without any measures, you cannot determine whether the warehouse has added value.

• Can be understood by the business community. The data in the warehouse must be understood to ensure that the users are capable of using it to full effect. The data must also mean the same to all users. For example, an algorithm that provides a statistic must be documented in a way that every user can understand.

• Is used by the business community because the value it delivers is tangible. If the data warehouse does not deliver quality information with integrity that adds value to the business, then it will not be used.

• Performs as defined by the users in any agreements outlined early in development.

• Belongs to the users and not the IT department.

.....................................................................................................................................................17-52 Data Warehousing Fundamentals

.....................................................................................................................................................Lesson 17: Managing the Data Warehouse

................................

A

Practice Solutions

.....................................................................................................................................................A-2 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 2-1Answer the following questions.

1 OLTP databases hold up-to-the-minute information and are most commonly designed as read-only databases.

True

False

The correct answer is False because OLTP databases are not read-only databases.

2 In the scenario below, state whether it refers to an operational system or an analytical processing system.

“Show me how a specific brand of printer is selling throughout different parts of the United States and how this specific brand of printer is selling since it was first introduced into my stores.”

This scenario refers to:

a An operational system

b An analytical processing system

The correct answer is B because comparing sales between the different territories within the United States can provide a certain type of analytical information.

3 Who is the target audience for the data warehouse?

a The business community in the organization

b IT professionals

c Data-entry clerks

d None of the above

e All of the above

The correct answer is A because the main reason for having a data warehouse is to aid the business community in making better decisions.

4 Are the following statements true or false?

a Operational systems display the following qualities:

Good performance TStatic data contents FHigh availability TUnpredictable CPU use F

b Identify the reasons why business analysis is not easy with operational systems.

Data is not structured for drill-downcapability. T

.....................................................................................................................................................Data Warehousing Fundamentals A-3

.....................................................................................................................................................Practice 2-1

The system is not designed for querying. FData analysis can be CPU-intensive. TData is not integrated between systems. T

5 In groups of three or four, discuss the questions below and present your points to the class at the end of the discussion.

a List some of the reasons that your company is considering implementing a data warehouse or data mart.

b What are some of the business problems that your company is trying to answer?

c Why is the business community in your organization unable to find the answers to their business questions based on the existing information systems?

General Answers

Why data warehousing? According Aaron Zornes, from the Meta Group, “IT organizations are under tremendous pressure to provide better quality decision-making information in forms easy to access and manipulate. Business users are reacting to their own mission-critical needs for better information due to rapidly changing, increasing volatile and competitive markets, as well as ever-shortening product life cycles.” Enterprises must become more competitive and get closer to their customers to survive. Some of the reasons as to why existing information systems are unable to provide the answers to business questions are:

– Much of the enterprise data is locked up in data “jailhouses”

– Operational systems are unable to provide a consolidated view of data

– Answering some of the business questions requires analyzing data patterns and trends over time. This often requires large volumes of historical data. Operational systems do not keep historical data. Therefore such type of analysis cannot be done in an operational system.

.....................................................................................................................................................A-4 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 3-11 Indicate which attributes belong to a data warehouse. Indicate whether the

statements are true or false.

2 _______ is a set of rules or structures providing a framework for the overall design of a system or product.

a Technical infrastructure

b Data-access environment

c Architecture

The correct answer is C.

3 The ________ is closely related to the architecture and consists of the technologies, platforms, databases, gateways, and other components necessary to make the architecture functional within the corporation.

Statement True False

a Data is organized by time.

Data exists in the data warehouse specifically for analysis by time.

True

b Data is always stored in a relational database.

It is not imperative that the data be stored in a relational database, although it is more common.

False

c Data relates to business-specific areas.

The data warehouse may be enterprisewide but the way the data is organized within the database is by departmental need, subject need, and functional need.

True

d Data is sometimes integrated.

Data must always be cleaned and integrated into the warehouse.

False

e Data is replaced according to a refresh cycle.

Data is added to and not replaced.

False

f Data warehouses may contain any type of data.

If the database server supports any type of data, then the warehouse is capable of holding any type of data.

True

.....................................................................................................................................................Data Warehousing Fundamentals A-5

.....................................................................................................................................................Practice 3-1

a Data access environment

b Technical infrastructure

c Data warehouse

The correct answer is B.

4 A telco company needs to understand their network traffic to better pinpoint frequent trouble spots and predict network expansion and usage. Storing call detail records and summarizing them by switch and trunk groups among other things in another environment will satisfy this need.

Which of the following are you going to design?

a Operational data store (ODS)

b Data warehouse

The correct answer is B because monitoring over a period of time is required.

5 An online bookstore has customers in their Sales Order System and in their Marketing System. These customers do not match between systems, because Marketing staff do not always update the Marketing System with current and complete customer data. The need here is for an integrated system that contains current customer data.

Which of the following are you going to design?

a Operational data store (ODS)

b Data warehouse

The correct answer is A because the organization needs current and integrated customer data.

6 Below are some of the benefits of data warehousing.

a Business decisions:

– Improves decision making process

– Provides basis for strategic planning efforts

– Improves business decisions (quality and quantity)

– Improves sales metrics

– Improves trend visibility

– Improves cost analysis

– Improves inventory and distribution channel management

– Improves monitoring of business initiatives

b Data access:

– Improves data availability and timeliness

.....................................................................................................................................................A-6 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

– Improves data quality

– Improves data integration

– Improves access to historical information

– Provides easier data access

– Allows high performance data mining

– Allows access to data not previously available

– Improves data availability for customers

c Costs:

– Reduces staff

– Identifies lost revenue

– Optimizes space utilization

– Reduces inventory

– Reduces inventory replenish time

d Productivity:

– Provides access to data without programmer intervention

– Facilitates elimination of legacy system

– Reduces analysis efforts

– Reduces impact on operational systems

– Reduces manual analysis and data consolidation efforts

.....................................................................................................................................................Data Warehousing Fundamentals A-7

.....................................................................................................................................................Practice 4-1

Practice 4-1

Interview QuestionsAsk the key persons the following questions.

Possible responses from each of the candidates are shown below.

Role 2: CFO1 What is the business vision?

– We are the market leader with a long tradition of dealing with drinks and beverages.

– We have survived by having a strong and focused management.

2 Why does the company need an enterprise data warehouse?

The board thinks that it is required to help maintain our competitive edge and market leading position.

3 What do you expect the data warehouse to provide, or what will you get out of the warehouse?

Directly nothing because our financial systems are fine, but it should keep the IT Director happy.

4 How soon do you need to have data loaded into the data warehouse and how up-to-date does the data need to be?

If we were to do this properly, we will need all the information in the warehouse up-to-date all the time.

Role 3: COO1 What is the business vision?

– Need to reengineer out core processes to maintain our market position.

– Overall goal is to give my group better control over the business.

2 Why does the company need an enterprise data warehouse?

To integrate the information from our disparate legacy systems and new systems as they come online—this should allow us to quickly analyze any of the information we hold.

3 What do you expect the data warehouse to provide or what will you get out of the warehouse?

– Detailed customer information such as who buys our products and where our products go, provide tracking information in case there is a need to recall things.

– Let us see demographics of beverage types from around the world.

– Allow us to perform “what-if” analysis.

.....................................................................................................................................................A-8 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

4 How soon do you need to have data loaded into the data warehouse and how up-to-date does the data need to be?

– Daily for our top 50 customers and a weekly update for the rest.

– We would probably also want to resegment our customers based on new transactions, for example, once per month.

Role 4: IT Director1 What is the business vision?

To support the mission statement, we need bigger and better systems to enable us to become more competitive.

2 Why does the company need an enterprise data warehouse?

In the new and modern business world you need a warehouse. Our competitors have one and we must have one in order to compete with them.

3 What do you expect the data warehouse to provide, or what will you get out of the warehouse?

– Better information

– Better control of new products

– Take our disparate systems and help integrate them, which will bring real business benefit and control

4 How soon do you need to have data loaded into the data warehouse, and how up-to-date does the data need to be?

We will have daily loads for our top 50 customers, with a weekly catch-up for the rest.

Class Discussion1 Identify the major challenges for a data warehousing implementation project, as

shown in this exercise.

2 Give your suggestions on how to overcome these challenges.

3 If you apply the Oracle Data Warehouse Method in this implementation to this project, how would apply it and where do you see the benefits from using this method?

General AnswersThis exercise has been designed to get you thinking about some of the many issues that face any DSS implementation, regardless of size or complexity. The following sections outline some of the issues.

Political Issues• Conflict between different parts of the business. In many businesses, very high

barriers have been constructed between departments; thus the DSS can be considered to be a threat, because it will remove these barriers.

.....................................................................................................................................................Data Warehousing Fundamentals A-9

.....................................................................................................................................................Practice 4-1

• Resistance to free and open information.

• General resistance to change. DSS implementations by their nature invoke an emotional reaction to change and so change management should be considered carefully. Avoid making statements such as “the system will help you make better decisions” because statements like this are emotionally charged.

• IT will tend to control the project. IT will see the problem as technical architecture and will therefore seek to own it.

• The business may see it as an IT project. This follows on from the last point. The business has to step up to the project. There are difficult decisions to make such as regarding what data we place in the system, how that data is defined, how long to keep it, and how to represent it. These are the decisions that the business must take, and not IT.

Approach Issues The approach to the project will have a significant impact on the overall success of the project. Some of the issues typically associated with a “bottom-up” approach include:

• The data warehouse may end up as a complex repository for operational data rather than one that can support the business decision making required. If this is the case, the business will inevitably lose faith in the system.

• The system will eventually lose faith in the data warehouse, and so it will become another piece of legacy.

• Failure to address data quality. A “bottom up” approach led by IT will typically avoid tough issues such as data quality, because IT typically lacks the influence to solve the problem. The solution lies with the business and not IT.

• Over or underengineering of the solution will result, because it is difficult to hit a target when you don’t know what it looks like, especially if it is a moving target.

• If the solution is seen by the business as technology rather than as a business solution, they are unlikely to invest time and effort in it. We know that if the business linkage is not present, the solution is unlikely to succeed.

Sponsorship Issues• Sponsorship is critical to a project success.

• Sponsorship must be effective—it is all well and good to have senior business sponsorship in the project, but this must be effective and active sponsorship, that is, involvement must be more than just attending regular meetings.

• The key sponsorship chain is linked to business rather than IT. This is largely because many of the more difficult, softer issues revolve around the business and therefore need a business pull rather than IT push to resolve.

• Communication to all stakeholders within the business is critical. The aims and aspirations for the project should be communicated as well as the progress of the project to assist in overcoming eventual resistance to change.

.....................................................................................................................................................A-10 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Business Vision Issues The business must be clear about a number of factors:

• How the warehouse will add value to the business

• Why the warehouse will result in business change

• How business change will impact on the warehouse

You may have noticed that the above issues constitute a circular argument, which is important for everyone concerned to understand fully. If the warehouse is not going to change your business, why build one?

General Information Issues Were the right questions asked and were honest answers always given?

• You will need to ask different questions to different parts of the business.

• You may not get the answers you need, because of a number of organizational and technical issues.

Because much of the information we need is both tacit and politically sensitive, you should not be afraid to ask follow-up questions.

.....................................................................................................................................................Data Warehousing Fundamentals A-11

.....................................................................................................................................................Practice 5-1

Practice 5-11 There are no standard solutions to item 1, as the answers are subjective and

unique to each student.

2 Similarly, there are no standard solutions to item 2, as the answers are subjective and unique to each user.The expectation is that students will utilize every strategy deliverable listed in the table, as each deliverable is considered essential for a successful warehouse implementation.

3 See answer to item 2.

.....................................................................................................................................................A-12 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 6-11 Complete the user profile column in this exercise with one of the following user

types:

– Executive

– Casual user or manager

– Business analyst or power user

2 Answer true or false to the following questions.

3 Security Consideration exercise: There are no standard solutions to this question.

Name Access Needs Technology User ProfileBrian O’ Reilly • Need to develop simple

forecast, such as budgets

• Ease of use is important

• Microsoft Office• Internet browser• Spreadsheets• Email

Casual user or manager

Mary Ramos • One click access• Only need highly

summarized information

• Ease of use is very important

• Email• Microsoft Office• Internet browser

Executive

Kim Seng • Constantly wants to “get more data”

• Understands the organization’s business processes

• Spreadsheet• Oracle Reports• Oracle Discoverer• Oracle Express

Analyzer

Business analyst or power user

Amber Salinas • Lots of drilling• Customize graphical

user interface (GUI)• Needs to know data

structures

• Extensive SQL programming

• Oracle7X, Oracle8X Server

• Oracle Express

Business analyst or power user

Question True False

a Do not involve users in the early process of the data warehouse implementation because they are going to delay your delivery date.

False

b Choose the warehouse data access tools by involving only IT staff because they are the ones who know what the users need.

False

c Prototype access methods with prospective users. True

.....................................................................................................................................................Data Warehousing Fundamentals A-13

.....................................................................................................................................................Practice 7-1

Practice 7-11 Identify whether the following statements are true or false.

2 Complete these sentences.

a Access to data in a _________ table is faster than calculating aggregates at the time of query execution.The correct answer is summary.

b The data warehouse model contains ____ tables that comprise the measures of the business.The correct answer is fact.

c Dimensions are denormalized in a _______ model.The correct answer is star.

d A common guideline is to define granularity at one level ________ than currently used by end users.The correct answer is lower.

3 There are no standard solutions to item 3, as the answers are subjective and unique to each student.

Question True FalseThe business model is a logical representation of selected business processes.

True

The star model is normalized. FalseThe snowflake model is denormalized. FalseAll warehouses must have a time dimension. TrueIn a warehouse environment, data loading performance is less important than query performance.

True

.....................................................................................................................................................A-14 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 8-11 Form into small groups, and consider each of the following hardware

architectures. With your books closed, create a short definition for each architecture. Each answer should include the benefits and limitations of each architecture.

a Symmetric multiprocessing (SMP): Definition, benefits, and limitations. Please refer to pages 8-12 to 8-13.

b Non-Uniform Memory (NUMA): Definition, benefits, and limitations. Please refer to pages 8-14 to 8-15.

c Clusters: Definition, benefits, and limitations. Please refer to pages 8-16 to 8-17.

d Massively parallel processing (MPP): Definition, benefits, and limitations. Please refer to pages 8-18 to 8-21.

2 Staying in your small group, discuss each of the following questions.

a What is parallelism?

It is the ability to perform functions in parallel.

b Why is it important to the data warehouse?

The appeal of parallel processing is especially strong for the data warehousing environment because of its emphasis on interactive processing of complex queries. Given this characteristic as well as the often extreme size of a warehouse, methods are clearly needed for more rapid query execution. By partitioning data among a set of processors, complex queries can be executed in parallel. This will potentially achieve linear speedup and thus significantly improve query response times.

.....................................................................................................................................................Data Warehousing Fundamentals A-15

.....................................................................................................................................................Practice 9-1

Practice 9-11 For the following description, state the type of partitioning method it best

describes. The partitioning methods are range partitioning, hash partitioning, and composite partitioning.

Description Partitioning MethodPlaces specific ranges of table entries on different disks. For example, records having “name” as a key may have names beginning with A-B in one partition, C-D in the next, and so on. Likewise, a DSS managing monthly operations might partition each month onto a different set of disks.

Range

Distributes DBMS data evenly across the set of disk spindles. This partitioning method is applied to one or more database keys, and the records are distributed across disk subsystems accordingly.

Hash

The drawback of this partitioning method is that the quantity of data may vary significantly from one partition to another and the frequency of data access may vary as well. For example, as the data accumulates, it may turn out that a larger number of customer names fall into the M-N range than the A-B range.

Range

This partition method is a combination of two partitioning methods. A table that is partitioned using this method is initially partitioned by range, and then subpartitioned using the hash method.

Composite

.....................................................................................................................................................A-16 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

2 For each of the following descriptions, state the type of indexing method it best describes. The indexing methods are B-tree, bimap, and index-organized tables.

3 Form into small groups, and consider each of the following questions. For each question, discuss in your groups and present your group’s answers to the class at the end of the discussion.

a How does RAID-5 differ from RAID-1?

RAID-1 (mirroring) is a strategy that aims to prevent downtime due to loss of a disk, but whereas RAID-5 in effect divides a file into chunks and places each on a separate disk, RAID-1 maintains a copy of the contents of a disk on another disk, referred to a mirrored disk. Writes to a mirrored disk may be a little slower because more than one physical disk is involved, but reads should be faster because of a choice of disks (and hence head positions) to seek to the require location.

b How do I decide between RAID-5 and RAID-1?

Description Indexing MethodContains a hierarchy of highest-level and succeeding lower-level index blocks. The upper level blocks are called branch blocks and they point to the lower-level blocks. The leaf blocks are the lower-level blocks and they contain the unique ROWID that points at the location of the actual row.

B-tree

This indexing method will benefit queries in which the WHERE clause contains multiple predicates on low-cardinality columns.

Bitmap

Bitmap

This method merges table data and index data into one structure. Thus, the data is the index and the index is the data.

Index-organized table

Table Row ID Male Female0001 1 00002 0 10003 0 10004 1 0

Each row has a bit for each key

Each key value hasa bit for each row.

.....................................................................................................................................................Data Warehousing Fundamentals A-17

.....................................................................................................................................................Practice 9-1

RAID-1 is indicated for systems where complete redundancy of data is considered essential and disk space is not an issue. RAID-1 may not be practicable if disk space is not plentiful. On a system where uptime must be maximized, Oracle recommends mirroring at least the control files, and preferably also the redo log files.

RAID-5 is indicated in situations where avoiding downtime because of disk problems is important, or when better read performance is needed and mirroring is not in use.

c What variables can affect the performance of a RAID-5 device?

The major ones are access speed of constituent disks; capacity of internal and external buses; number of buses; size of caches; number of caches; and nature of the algorithms used for determining how reads and writes are done.

d What types of files are suitable for placement on RAID-5 devices?

Placement of data files on RAID-5 devices is likely to give the best performance benefits, because these are usually accessed randomly. More benefit will be seen in situations where reads predominate over writes. Rollback segments and redo logs are accessed sequentially (usually for writes) and therefore are not suitable candidates for being placed on a RAID-5 device. Also, data files belonging to temporary tablespaces are not suitable for placement on a RAID-5 device.

4 For each of the descriptions below, assign the RAID level that is RAID 0, RAID 1, or RAID 5.

Description RAID LevelThis RAID level has the lowest cost and highest performance.

0

This RAID level is low cost and has high availability.

5

This RAID level has high performance and high availability.

1

.....................................................................................................................................................A-18 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 10-1Please answer the following questions.

1 The acronym ETT stands for _________________________________________.The correct answer is extraction, transformation, and transportation.

2 Name at least four potential sources of production data for the warehouse.

_____________________

_____________________

_____________________

_____________________Correct answers include production operational systems; archives; internal files not directly associated with company operational systems, such as individual spreadsheets and workbooks; external data from outside the company.

3 Name at least five potential sources of external data for the warehouse.

___________________________________________

___________________________________________

___________________________________________

___________________________________________

___________________________________________Correct answers include periodicals and reports; external syndicated data feeds; competitive analysis information; newspapers; purchased marketing, competitive, and customer related data; free data from the Web.

.....................................................................................................................................................Data Warehousing Fundamentals A-19

.....................................................................................................................................................Practice 10-1

4 Identify whether the following statements are true or false.

Question True FalseArchive data is never used in a data warehouse; it is too old.Archive data is particularly useful for the first time load, to include historical data.

X

External data is one of the easiest types of data to incorporate into the warehouse.External data is difficult to incorporate, as it varies in frequency, grain, and predictability.

X

It is impractical to eliminate data anomalies after the pilot run.Never leave data cleanup this late.

X

Mapping data is a process whereby you eliminate data inconsistencies.Mapping identifies source data attributes, identifies where they are to reside in the warehouse, and identifies what transformations are needed.

X

Gateways are great mechanisms for transferring large volumes of data into the warehouse.Gateways are only useful for smaller amounts of data.

X

Extraction tools are expensive. X

Transforming data occurs only in the staging area.It may take place at other points, though the staging area is most common.

X

.....................................................................................................................................................A-20 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 11-11 Dirty data must be eliminated for the data warehouse. Name three alternative and

common terms used to describe the process of eliminating anomalies in data.

_____________________

_____________________

_____________________The correct answer is cleaning, cleansing, and scrubbing.

2 Name at least five problems associated with source data that must be eliminated for the data warehouse.

___________________________________________

___________________________________________

___________________________________________

___________________________________________

___________________________________________The correct answer is multipart keys, multiple encoding, multiple local standards,

multiple files, missing values, element names, element meaning, input formats,duplicate values, referential integrity, names and addresses.

3 Identify whether the following statements are true or false.

Question True FalseIt is considered impractical to eliminate data anomalies after the pilot run.Never leave data cleanup this late.

X

You need to consider adding time keys to warehouse data.All records must contain a time element contained in a key column.

X

Data transformation occurs only in the staging area.It may take place at other points, though the staging area is most common.

X

.....................................................................................................................................................Data Warehousing Fundamentals A-21

.....................................................................................................................................................Practice 12-1

Practice 12-11 Assemble into small groups of 3 or 4. Discuss and compare the factors that will

determine the load window where you work. Consider user requirements, operational constraints, and staffing issues.

There is no single correct answer.

2 Identify whether the following statements are true or false.

Question True FalseTransportation of data involves moving the data into the data warehouse database.Strictly transportation involves move and loading the data.

X

The data refresh cycle is determined by information technology groups.The cycle is determined by users.

X

The load window is the time that the IT group has dictated the data warehouse is available to the users for access.The load window is time available to perform all ETT tasks.

X

An example of high-level grain data is summarized data.

X

Fact data frequently changes.Fact data is frequently added to at every refresh.

X

Dimension data infrequently changes.Dimension data changes but not as frequently as fact data is refreshed.

X

SQL*Loader is the fastest way to move data into the data warehouse database.

X

Gateways are useful for moving large amounts of data into the warehouse.Gateways are recommended only for small amounts of data.

X

.....................................................................................................................................................A-22 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

3 Name the two different types of data loading.

_____________________

_____________________The correct answer is first time load and refresh.

4 Name four methods of moving data to the warehouse server.

_____________________

_____________________

_____________________

_____________________The correct answer is that there are five listed ways, and you may choose a hybrid of any of these.

– Wholesale data replacement

– Comparison of database instances

– Time and date stamping

– Database triggers

– Database log

5 What SQL command is used to create summary tables on the data warehouse server?

The correct answer is CREATE TABLE AS SELECT (CTAS), or

CREATE TABLE AS SELECT... PARALLEL (pCTAS).

Data for the data warehouse is always indexed after it is loaded.It is recommended, but is not always indexed after.

X

The quickest way to create unique indexes on warehouse data is to leave database constraints enabled on load.The fastest way is disable constraints and then enable them after the data is loaded.

X

Summary tables are created on the warehouse server.

X

Filtering removes unwanted records from staging files.Filtering extracts data from the warehouse into data marts.

X

Question True False

.....................................................................................................................................................Data Warehousing Fundamentals A-23

.....................................................................................................................................................Practice 13-1

Practice 13-11 Identify whether the following statements are true or false.

2 Name four different techniques for capturing the changes to operational data that is to be loaded into the warehouse.

_____________________

_____________________

_____________________

_____________________The correct answer is that there are five listed ways, and you may choose a hybrid of any of these.

– Wholesale data replacement

– Comparison of database instances

– Time and date stamping

– Database triggers

– Database log

3 Answer the following questions about updating dimension data.

a What method of updating dimension data would you employ if you wanted to keep old and new records?The correct answer is keep history.

b What relationship would that map to in an entity relationship model?The correct answer is a one to many.

4 What server technique can be used to prevent and allow access to data in the warehouse after refresh?The correct answer is the ROLES command.

Question True Falsea The data refresh cycle is determined by information

technology groups.The cycle is determined by users.

X

b Fact data frequently changes.Fact data is frequently added to at every refresh.

X

c Dimension data infrequently changes.Dimension data changes but not as frequently as fact data is refreshed.

X

.....................................................................................................................................................A-24 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 14-11 Give one example of where metadata exists in an operational environment.

________________________________________________________The correct answer is the database server data dictionary.

2 Why is metadata important to the following people?

a Users who are accessing the data warehouse

________________________________________________________

________________________________________________________The correct answer is that it provides them with information about the data they are accessing, and shows them the meaning of data, context, summary levels, ownership and many more attributes.

b IT staff developing ETT routines

________________________________________________________

________________________________________________________The correct answer is that it contains all source data information, transformation routines, mapping, structure and meaning of data.

3 Name two techniques you might employ to create metadata.

________________________________________________________

________________________________________________________The correct answer is that you may have chosen two from this list:

Data modeling tools, data dictionary, ETT tools, end user tools, COBOL copybooks, middleware tools.

4 Name two roles within the data warehouse development team who have responsibility for metadata.

________________________________________________________

________________________________________________________The correct answer is metadata architect, metadata manager.

5 What is the issue with integration and metadata?

________________________________________________________

________________________________________________________

________________________________________________________The correct answer is that many tools have their own metadata layers, which must be integrated for the environment.

.....................................................................................................................................................Data Warehousing Fundamentals A-25

.....................................................................................................................................................Practice 14-1

6 What is important about the context of data?

________________________________________________________

________________________________________________________The correct answer is that it allows the historical perspective of data to be constantly available.

7 Name the Oracle tool you can use to develop metadata.

________________________________________________________The correct answer is Oracle Designer, Data Mart Suite, or OADW. Oracle Warehouse Builder will also support metadata management.

.....................................................................................................................................................A-26 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 15-11 In the following scenarios, choose the type of analysis that most accurately defines

the scenario. The types of analysis from you may choose are:

– Query and reporting

– Multidimensional/OLAP

– Data mining

– Drill-down and pivot

– Calculations and derived data

– Spreadsheet

– Modeling, time-series and financial

– What if

2 For the following phrases and sentences, determine which category each of them belongs to. You may choose from the following list.

• Data

• Information

• Knowledge

Scenario Type of Analysisa. Show start date and salary grade for all employees reporting to

Clare MauryQuery and reporting

b. Highlight all orders above $30,000.00

• Drill from product totals to individual orders• Look at a copy of the invoice

Drill-down and pivot

c. Show product sales in each region as a percentage of the total sales in that region.

Calculation and derived data

d. Did the $2 million promotion increase sales? Modeling, time-series and financial

e. How many people to hire, when to hire them, and where to locate them.

Modeling, time-series and financial

f. If we lowered prices, would our overall revenue increase? What-ifg. Find me the relationship between X and Y. Data miningh. Show me all the products that are currently back-ordered. Query and reportingi. What is the 13 week moving average of sales? Calculations and

derived dataj. Projecting costs and allocating overhead based on head count,

sales forecasts, and consumer price index (CPI).Modeling, time-series and financial

.....................................................................................................................................................Data Warehousing Fundamentals A-27

.....................................................................................................................................................Practice 15-1

• Decision

3 The diagram below illustrates an example of data mining. The technique that it uses is called _________________.

The correct answer is artificial neural network.

4 The description below describes a data mining technique. What is the technique used?

The correct answer is decision tree.

Description CategoryMary lives in Belmont Shores, California. InformationPoint of sale (POS) DataAppleTree juice is bought 45% of the time that Crystal Geyser juice is bought.

Knowledge

Let us promote Crystal Geyser juice on the East Coast of the United States in stores.

Decision

Demographic DataCustomers of the upper middle class will use 10% of their annual income during the Christmas holiday season.

Knowledge

Age

Region

Call Rate

Service

Lost

Loyal

1. If the vehicle has a 2-door frame AND

2. If the vehicle has at least six cylinders AND

3. If the buyer is less than 40 years old AND

4. If the cost of the vehicle is > $35,000 AND

5. If the vehicle color is red, THEN

6. The buyer is likely to be male.

.....................................................................................................................................................A-28 Data Warehousing Fundamentals

.....................................................................................................................................................Appendix A: Practice Solutions

Practice 16-1

Web-Based Tools Requirement ChecklistThere are no standard solutions to this question.

..................................

Glossary

.....................................................................................................................................................Data Warehousing Fundamentals Glossary-3

.....................................................................................................................................................Glossary

A

Access The process of accessing the data warehouse database objects containing data using tools that perform analysis, standard queries, provide statistical information, and mine data. See OLAP, Data Mining, Data Access.

Additive Measurements in a fact table that can be added across all the dimensions. See Dimension.

Ad hoc One time only, casual, nonplanned access to the database. See Access, Data access.

Aggregated data Precalculated and pre-stored summary data that is held in tables in the data warehouse. Aggregated data pro-vides direct access to calculated data that improves query performance. Functions used to calculate aggregated data include SUM, MAX, MIN, COUNT, and AVG. See Sum-mary Tables.

Aggregated facts See Aggregated facts, Summary tables.

Application Program Interface A set of calling conventions that allow application programs to access computing services. APIs present application developers with a pub-lished interface to computing services that can be used with other facilities to provide a single-system image across a heterogeneous network of processors.

Atomic data The data at its lowest level of detail that provides the base data for all data transformations.

Atomic value A data value that cannot be further decomposed.

Attribute Any detail that serves to qualify, identify, classify, quantify, or express the state of an entity.

B

Backup and recovery strategy A storage and recovery strategy that protects against business information loss resulting from hardware, software, or network faults.

BAP See Business Alliance Program.

Batch A computer environment that pro-cesses an action or user request without user interaction. Some batch programs work in the background, allowing simultaneous user access.

Bitmap index A specialized form of index indicating the existence or nonexistence of a record by a series of ones and zeros. Preva-lent with the Oracle7 and Oracle8 database servers.

Bitmapped interface See graphical user interface.

Business An enterprise, commercial entity, or firm in either the private or public sector, concerned with providing products or ser-vices to satisfy customer requirements.

Business area The set of business processes within the scope of a data warehouse project.

Business Alliance Program (BAP) An Ora-cle initiative that invites vendors to offer products and services that are complemen-tary to those offered by Oracle.

.....................................................................................................................................................Glossary-4 Data Warehousing Fundamentals

.....................................................................................................................................................Glossary

Business metadata The information pro-vided to users that allows them to understand and access warehouse data. It focuses on what data is in the warehouse, how it was transformed, the source, and the timeliness of the data. See User Metadata.

Business rule A rule under which an organi-zation operates.Business rules are applied to data using constraints.

C

C A third generation programming lan-guage.

C++ A thrid generation programming lan-guage.

Cache A temporary storage area in com-puter memory.

Cardinality The number of rows in a table. See Table, Column, and Row.

CASE See Computer-aided systems engi-neering.

Checkpoint A database server event which at a point in time writes all modified data-base buffers in the system global area to the data files. The process controlling this action is called the Database Writer (DBWR).

Cleaning See Cleansing.

Cleansing The process of transforming the operational and external source data into a defined, and standardized format using pack-aged software applications or programs, prior to moving that data into the warehouse. Also referred to as data cleaning, data cleansing, or scrubbing. See Source data.

Client-server A technical architecture that links many personal computers or work-stations (clients) to one or more large proces-sors (CPUs or servers). The architecture enables the separation of local client process-ing from the server that manages the data-bases, access, and data integrity. The architecture allows for optimal performance at both the client and the server sides.

Cluster A means of sorting and storing related data from different tables in the data-base, on cluster keys. Advantageous in an environment where related data is commonly queried together.

COBOL A third generation programming language.

Column A means of implementing an item of data within a table. See Table, Row, Attribute.

Composite key A key in a database table that is made up of a number of (column or field) values.

Compound key See Composite key.

Computer-aided systems engineering (CASE) The combination of graphical, dic-tionary, generator, project management, and other software tools to assist computer devel-opment staff engineer and maintain high-quality systems.

Concatenated key See Composite key.

Concatenated index An index that is cre-ated on a composite key. See Composite key.

Constellation model A warehouse model that comprises a collection of star models. See Star model, Snowflake model.

.....................................................................................................................................................Data Warehousing Fundamentals Glossary-5

.....................................................................................................................................................Glossary

Constraint 1.The part of the WHERE clause in an SQL SELECT statement that identifies the column or field value that qualifies the query. 2. Any external, management, or other factor that restricts a business or a systems development in terms of resources, availabil-ity, dependencies, timescales or some other factor. See Business rule.

CORBA Common Object Request Broker Architecture

Corporate data model A model of the business needs and data requirements for an online transaction processing system.

Cost based optimizer A statistical mecha-nism that analyzes where and how to retrieve data from the Oracle7, Oracle8, and Oracle8i servers to ensure fast access to data.

Cube A commonly used name for a dimensional database where values can be analyzed across a minimum of three dimen-sions.

D

DASD See Direct-access storage device.

Data access See Access.

Data acquisition The process of extracting, transforming, and transporting data from the source systems and external data sources to the data warehouse database objects. The term is synonymous with ETT, and is widely used within Data Warehouse Method. See ETT.

Data aggregation The process of redefin-ing data into a summarization based on some rules or criteria. See Aggregated data, Aggregated facts, Summary tables.

Data Definition Language (DDL) SQL statements that create, modify, and remove database objects such as tables, indexes, and users. Common DDL statements are CRE-ATE, ALTER, and DROP. See DDL.

Data extract A subset of data extracted from one environment and transported to another environment. See Extract process-ing.

Data integrity The quality of the data resid-ing in the database objects. Constraints on the database tables enforce integrity rules.

Data Manipulation Language (DML) SQL statements that query and amend the database data. Common DML statements are SELECT, INSERT, UPDATE, and DELETE. See DML.

Data mart A data warehouse data class organized for a business functional area or department. The database contains data sum-marized at multiple levels of granularity and maybe designed using relational or multidi-mensional database structures. Data migra-tion tools Unspecified tools that allow data to be moved from the various sources into the data warehouse.

Data mining A technique that discovers previously unknown patterns and relation-ships in data. Data mining queries may take a long time to execute.

Data warehouse An enterprise-struc-tured repository of subject-oriented, time variant, integrated, historical data used for information retrieval. The very large data warehouse database stores atomic and sum-mary level data. The data warehouse pro-vides the source data for data marts within the enterprise.

.....................................................................................................................................................Glossary-6 Data Warehousing Fundamentals

.....................................................................................................................................................Glossary

Data Warehouse Method (DWM) A structured method for full life-cycle custom development data warehouse projects. It is based on the Custom Development Method. See Custom Development Method.

Database A collection of data, usually in the form of tables or files, under the control of a database management system. See Database management system.

Database administrator A person within the information technology (or information systems) organization who is responsible for administering, monitoring, and maintaining the database.

Database management system The compo-nent of a database that controls all user and system activities related to the core functions of the database, such as security checking, tablespace allocation, space management.

Data model A representation of the specific information requirements of a business area. See Entity relationship diagram.

Data source See Source.

DBA See Database administrator

DBMS See Database management system, Relational database management system.

DDL See Data Definition Language.

Decision support The act of using data and tools within an organization to support managerial decisions. Usually decision sup-port involves the analysis of many units of data in a heuristic fashion. As a rule, decision support processing does not involve updating data. See Heuristic.

Decision support systems (DSS) An application used to provide summary or con-solidated data to users for analysis, planning, and performing what-if analysis by using specialized tools that are usually driven by a GUI. See Graphical user interface.

Delta A file created by an application that contains only changes made to the applica-tion.

Denormalization A database design func-tion that restructures a database by introduc-ing derived data, replicated data, and repeating data. The technique is often employed to enhance performance within decision support and data warehouse envi-ronments. See Data warehouse, Decision support systems.

Denormalized data The data within a denormalized database model. See Denor-malization.

Dependent data mart A data mart that is sourced directly from an existing data ware-house. See Data mart, Independent data mart.

Derived column A value derived by some algorithm from the values of other columns. See Derived data.

Derived data Data that exists only as a sub-set of other data. Also called Derived attribute.

Designer/2000 The Oracle computer-aided systems engineering (CASE) tool.

Detail data See Fact data.

.....................................................................................................................................................Data Warehousing Fundamentals Glossary-7

.....................................................................................................................................................Glossary

Developer/2000 The Oracle application building tool for query, reporting, database manipulation, and graphical display of data-base values.

Dimension A construct within a multidi-mensional structure that represents a side of a multidimensional cube. Each dimension represents a different category that the busi-ness chooses to measure by, such as cus-tomer, region, product, and time.

Dimension data The data by which the user queries the business measurables. Contained in dimension tables. See Fact data, Fact tables, Dimension table, Dimension model.

Dimension table A table in a star model that is joined to the fact table by a key value.

Dimensional model A model that supports a top-down design methodology. For each business process, it determines relevant facts and dimensions.

Direct-access storage device (DASD) A data storage unit where data can be accessed directly without having to progress through a serial file such as magnetic tape.

Dirty data Data that is in an unfit state to be loaded into the data warehouse. It must be transformed first. See Transformation, Cleaning.

Discoverer The Oracle end-user analysis, query, and reporting tool that is particularly good for use in the data warehousing envi-ronment.

Discrete Usually used with reference to dimension attributes. Data, usually text, that takes on a fixed set of values that rarely change.

DML See Data Manipulation Language.

Drill-across A technique that queries data from two or more fact tables in a single report.

Drill-down An analytical technique that queries data from a summary row and navi-gates through a hierarchy of data to reach the detail-level rows.

Drill-up An analytical technique that nav-igates from detail to header rows of data. Use to view summarized (or aggregated data).

DWM See Data Warehouse Method.

E

End User Layer (EUL) The user inter-face and layout of multidimensional struc-tures designed for the data access tools. This includes customization of the tools for end users.

Enterprise A group of departments, divi-sions, groups, or companies that make up a business. See Business.

Enterprise Manager An Oracle product that gives a GUI front end to systems and databases for enterprise wide systems man-agement.

Enterprise model A neutral model of the business.

Entity relationship diagram (ERD) A dia-gram that pictorially represents entities, the relationships between them and the attributes used to describe them.

.....................................................................................................................................................Glossary-8 Data Warehousing Fundamentals

.....................................................................................................................................................Glossary

Entity relationship model (ERM) A type of data model. Part of the business model that consists of many entity relationship dia-grams. See Entity relationship diagram.

ETT An acronym that stands for extraction, transformation, and transportation. It refers to the methods involved in cleaning opera-tional data and moving it from source sys-tems into the warehouse.

EUL See End-user layer.

Express The generic name of a suite of Oracle products that enable users to analyze multidimensional data and perform complex analysis for decision support.

External data Data originating from a non-operational source or outside the central pro-cessing complex, such as magazines, newspapers, and financial companies.

Extract processing The process of select-ing data from one environment and trans-porting it to another environment for use by individual users or departments.

Extraction The process of selecting and pulling data from the operational and exter-nal data sources, in order to prepare it for the warehouse. Also called data extraction.

Extraction, transformation, and transporta-tion See ETT.

F

Fact data The measurements, within the core of the data warehouse, on which all OLAP queries depend. See Online analytical processing, Fact table.

Fact table The core (central) table in a star or snowflake model, characterized by a composite key. Values in the composite key join to keys in the dimension tables. See Composite key, Dimension table, Detail data.

Feedback Response to requests, including corrections, additions, and approval elicited from users, sponsors, and any others with an interest in the data warehouse.

File Transfer Protocol (FTP) A method for transferring files from one location to another.

Foreign key A key data value, (which may comprise one or more columns), in a relational database table that joins to a pri-mary key on another table. See Primary key.

Forms See Oracle Forms.

FTP See File Transfer Protocol.

G

Gap analysis The process of determining and evaluating the variance between two items’ properties.

Gateway A technology that enables inter-server communication using various commu-nication protocols.

Generalized key A dimension table primary key that is created by modifying an existing key. Generalized keys are also used with slowly changing dimensions and summary data.

Gigabyte One thousand million bytes.

.....................................................................................................................................................Data Warehousing Fundamentals Glossary-9

.....................................................................................................................................................Glossary

Grain The level of detail of the data stored in the database or data warehouse or moved into the data warehouse from source sys-tems.

Granularity See Grain.

Graphical user interface (GUI) A user interface that is driven by point-and-click operations using a mouse rather than a key-board. Also known as a bitmapped interface.

H

Heuristic The process of learning by dis-covery.

Hierarchical database An older style of database where records are strictly related and access is strictly defined.

Householding In the financial services sec-tor, assigning a customer account or individ-ual, to a collection of accounts, individuals, or locations for marketing purposes.

Hypercube A multidimensional model sup-porting more than three dimensions. You can visualize this model by considering a number of three dimensional cubes that are related to one another.

Hypertext Markup Language (HTML) The language used to create HTML pages for the Web using a word processor or text editor.

Hypertext Transfer Protocol (HTTP) The first component, the protocol, of a URL address, used widely in the Internet and intranet environment. HTTP defines how to interpret information. Other common proto-cols you may come across include FTP, news, and gopher. See Uniform Resource Locator.

I

Implementation The installation of an increment of the data warehouse solution (hardware, software, documentation, train-ing) that is complete, installed, tested, proved, operational and ready to use.

Increment The defined scope of the portion of the data warehouse selected for imple-mentation. Each increment satisfies elements of the total data warehouse solution.

Incremental development A technique for producing all or part of a production system based on an outline definition. The technique involves iterations of a cycle of build, refine, and review so that the correct solution emerges.

Independent data mart A data mart that is sourced directly from operational systems. See Data mart, Dependent data mart.

Index An area of the database storage dedi-cated to holding key data values to allow direct access to a database row.

Information requirement The detail and summary data and access functionality required to satisfy the users’ decision support and analysis functions for decision making and planning.

Initial load The first population (insert) of the production data warehouse database with data from source systems. This load often contains large amounts of historical data. See Load, Refresh cycle.

Integrate To take data from a variety of dif-ferent sources, in different formats, and merge it into a single format.

.....................................................................................................................................................Glossary-10 Data Warehousing Fundamentals

.....................................................................................................................................................Glossary

Integrity rules The laws that govern the operations allowed on the data and structures of a database.

Internal data Data that resides within an organization’s central processing complex.

Iterative development The application of a cyclic, evolutionary approach to system development.

K

Knowledge worker A person whose job relies on information as a primary resource.

L

Legacy system An existing operational sys-tem that is used for entering data about the company’s operations.

Level fields These fields are often held in dimension tables and relate to summary data stored in the central fact table. Not a com-mon approach to storing summary data.

Load The process of moving extracted, transformed into the data warehouse. See Initial load, Refresh cycle.

Load window The time taken to load data from multiple source systems into the data warehouse. Can also be used to mean the time available for the data load.

Logical model The phase of database design that is concerned with identifying the rela-tionships among the tables.

M

Mapping The process of matching data from source systems to the structures in the data warehouse.

Mapping tools Tools used to perform map-ping.

Massively Parallel Processor (MPP) A shared nothing architecture that takes a num-ber of nodes and enables them to communi-cate rapidly.

Metadata Data that contains information about the data and structures in the data warehouse. Metadata is both for business users and technical users. See Business meta-data and User metadata.

Metalayer An architectural component of the warehouse that resides between the ware-house data and the user, and contains meta-data. See Metadata.

Middleware A layer that provides an easy-to-use, intuitive presentation of the underly-ing data or data structures.

MOLAP See Multidimensional online ana-lytical processing.

Multidimensional analysis See Online ana-lytical processing.

Multidimensional database A database management system where data can be viewed and manipulated in multiple dimen-sions. It provides a structure that supports specialized query techniques such as drill-down, consolidation, and slicing and dicing. See Cube.

.....................................................................................................................................................Data Warehousing Fundamentals Glossary-11

.....................................................................................................................................................Glossary

Multidimensional online analytical pro-cessing (MOLAP) Data is stored and pre-sented to the user over three or more dimen-sions.

N

Nonadditive A fact that cannot be logically added between records. May be numeric and must be combined in a computation with other facts before being added across records.

Nonuniform memory access (NUMA) A method of accessing shared memory on sys-tems which have memory loosely coupled. Oracle Parallel server can work with this access method.

Normalization A technique that eliminates data redundancy. See Normalized data.

Normalized data Data that has been sepa-rated into groups linked by defining normal relationships, where all redundancy in the data and repeating groups of data are removed. The usual normalization level is called third normal form, represented as 3NF. See Normalization.

NULL The state of a data item that indicates no value.

NUMA See Nonuniform memory access.

O

ODS See Operational data store.

OLAP See Online analytical processing.

OLAP Server A multidimensional database that provides a data structure that enables flexible access to data and explores the rela-tionship between summary and detail data.

OLTP See Online transaction processing system, Operational system.

Online analytical processing (OLAP) A loosely defined set of principles that provide a dimensional framework for decision sup-port. Online analytical processing allows for analysis of data to reveal business trends and statistics that are not immediately visible in operational data. Also known as multidimen-sional analysis.

Online transaction processing system (OLTP) The process whereby day-to-day transactional data is held in a repository that contains the operational data for the busi-ness.

Operational data Data that is maintained and used for the day-to-day processing and functional requirements of the business.

Operational data store A repository of current and integrated operational data used for analysis. It is often structured and sup-plied with data in the same way as the data warehouse, but may act simply as a staging area for data to be moved into the ware-house.

Operational system A system that supports day-to-day transactional information that supports the client’s business. See Online transaction processing system.

.....................................................................................................................................................Glossary-12 Data Warehousing Fundamentals

.....................................................................................................................................................Glossary

Oracle Expert An expert systems advisor that generates performance tuning recom-mendations based upon a global system view. Suggestions regarding space alloca-tion, schema design, and indexing strategies help DBAs tune VLDB environments.

Oracle Forms An Oracle Developer/2000 tool for creating, maintaining, and running full-screen, interactive applications called forms. The forms enable users to see and change data in an Oracle database. They can be used in block mode, character mode or bit-mapped environments.

Oracle Method The methodology employed by Oracle for corporate system implementa-tion. Incorporates the Data Warehouse Method and project management software.

Oracle Parallel Server See Parallel Pro-cessor, Oracle Server.

Oracle Reports The powerful, flexible Oracle Developer/2000 report-writing tool. Reports may be integrated with Oracle Forms or run stand-alone.

Oracle Server The Oracle relational data-base management system (RDBMS). Com-ponents of the Oracle server include the kernel and various utilities for use by data-base administrators and users. See Relational database management system, Server.

Oracle Trace A performance data manage-ment tool that collects, manages, and dis-plays performance data from throughout the enterprise, including resource use (CPU, I/O, page faults) by user or component.

P

Parallel Processor The Oracle server component that splits a single database action into many processes. See Parallel Query Option.

Parallel Query Option The Oracle server option that splits a single database query request into a series of parallel query opera-tions. See Parallel Processor.

Partitioned data Data that is physically divided across many hard disks. Data may be partitioned horizontally or vertically. The technique improves application performance and security. Also called Data partitioning.

Partitioning Splitting data across different units. Partitioning may be achieved at the system or application level.

Pilot An initial project that serves as a model or template for future projects.

Pivoting A query technique that enables the arrangement of rows and columns to be changed in a report.

PL/SQL See Procedural SQL.

Primary key A single or multiple column value that uniquely identifies a single row in a relational database table.

Procedural Gateway Middleware that enables data on a non-Oracle database to be viewed from Oracle applications. See Mid-dleware, Transparent Gateway.

Procedural SQL An extension to Oracle SQL. It enables SQL to be embedded within third generation programming constructs such as GOTO and LOOP statements for finer programming control.

.....................................................................................................................................................Data Warehousing Fundamentals Glossary-13

.....................................................................................................................................................Glossary

Process 1. A key element of Oracle Method. A cohesive set or thread of related tasks that meets a specific project objective. A process results in one or more key deliver-ables. 2. A sequential execution of functions triggered by one or more events. See Oracle Method, Data Warehouse Method (DWM).

Proof-of-concept An approach that contains a well-defined set of objectives and is scoped to demonstrate the immediate business bene-fit of an increment of the data warehouse. See Increment.

Q

Query Manager Middleware that presents the user querying data with an easy-to-use and clear picture of the underlying business data.

R

RDBMS See Relational database manage-ment system, Oracle Server.

Reach-through Used by online analytical processing tools to access directly data on a relational database server. The tool presents the data in a multidimensional manner.

Reference data Data held in reference tables. See Reference tables.

Reference tables Hold textural data that contain expanded descriptions of data resi-dent in dimension tables.

Referential integrity A condition that guarantees that the values in one column also exist in another column. This guarantee is enforced through the use of integrity con-straints.

Refresh The process of updating the data warehouse database objects with new data. The refresh process occurs on a predefined and scheduled basis after initial load. See Initial load, Refresh cycle.

Refresh cycle The frequency by which data in the data warehouse database objects is updated with new data. The cycle is deter-mined by user business requirements. Regu-lar process of updating the data warehouse with further fact (detail) data and creating appropriate summary tables and data indexes.

Relational database management system (RDBMS) Software that creates and maintains the database system, as well as the data stored in the database (in Oracle terms, Version 6 and earlier). See Server.

Relational online analytical processing (ROLAP) An implementation that presents the user with a multidimensional view of data that originates from a relational data-base structure.

Replication Method whereby copies of databases are maintained at multiple sites in a distributed system, to improve availability and response times. Replication is frequently employed as part of a backup and recovery strategy.

Reports See Oracle Reports.

ROLAP See Relational online analytical processing.

Row A series of attributes that identify the characteristics, to be stored on the database, of a significant object, such as a person. Also referred to as tuple. See Table.

.....................................................................................................................................................Glossary-14 Data Warehousing Fundamentals

.....................................................................................................................................................Glossary

S

Schema A logical representation or model of a database structure.

Scrubbing See Cleansing.

Semiadditive A numeric fact that can be added along some dimensions in a fact table but not others.

Server Software that handles the func-tions required for concurrent, shared access to a database. The server receives and pro-cesses SQL and PL/SQL statements originat-ing from client applications. The computer that runs the Server must be optimized for its duties. The Oracle server was previously called the Relational database management system. See Relationaldatabase manage-ment system.

Slice and dice A mechanism whereby a query can analyze information along any dimension of the multidimensional model equally.

Slowly changing dimensions The tendency of dimension records, particularly the prod-uct and customer dimensions, to change gradually or occasionally over time.

Snapshots A copy (or dump) of the data in a database at any given point in time.

Snowflake model A normalized version of the star model, employed in data warehouse implementations. See Star model, Constella-tion model.

Source data The data that is used as the basis of warehouse data, maybe from a data-base, flat files, or magazine articles. Also called data source.

SQL*Loader An Oracle tool that enables streams of data to be loaded into files or a database.

SQL (Structured Query Language) The internationally accepted standard language for relational systems. See Data Manipula-tion Language, Data Definition Language.

SQL statement A complete command or statement written in the SQL language.

Staging area A file, operational data store, or series of relational database server tables that contains the data to be moved to the warehouse.

Star query Optimization technique that enables the dimensions and fact tables in the star model to be accessed efficiently, and data to be returned to the user efficiently. It ensures that the dimension data is visited first, and the fact data last and only once.

Star model A database organization in which a fact table with a composite key is joined to a number of single-level dimension tables. The model is used in data warehouse implementations. See Constellation model, Snowflake model.

Subject area A vertical portion of the busi-ness, such as Sales and Marketing, that is developed as an iteration of the enterprise-wide data warehouse.

Summary data Data that is aggregated and stored in a summary fact table and made available to the user for direct and easy access.

Summary table A data structure in the warehouse that contains summarized (or aggregated) facts. See Summary data.

.....................................................................................................................................................Data Warehousing Fundamentals Glossary-15

.....................................................................................................................................................Glossary

Symmetric Multiprocessor (SMP) A shared everything hardware and software architecture, where memory and disk con-trollers are accessible to all CPUs. See CPU.

System Global Area (SGA) A large area of memory allocated to a database instance for caching. See Cache.

T

Table A relational database structure that comprises vertical columns (attributes) and horizontal rows (tuples) of data. See Primary key, row, and column.

Terabyte One trillion bytes.

Time stamp A date and time value written to a record when it is created or changed in the database.

Transformation The process of redefining data based on predefined rules, using spe-cific formulas and techniques. Also called data transformation. See ETT.

Transparent Gateway Middleware that enables viewing of data resident in a non-Oracle database from Oracle applications. See Middleware, Procedural Gateway.

Transportation The movement of data to the warehouse server. Also called data trans-portation. See ETT.

U

Uniform Resource Locator (URL) Text used to identify and address an item in a computer network.

Usage curve A line chart showing the amount of CPU used at any time during nor-mal system activity.

User A person at any level of the organiza-tion who needs to access the data in the data warehouse for information in order to per-form a business function.

User metadata The information provided to users that allows them to understand and access warehouse data. It focuses on what data is in the warehouse, how it was trans-formed, the source, and the timeliness of the data. See Business metadata and Transfor-mation.

V

Very large database (VLDB) A very large database is measured in gigabytes and Ter-abytes.

Very large memory (VLM) Computers with 64 bit memory structures.

VLDB See Very large database.

VLM See Very large memory.

W

Warehouse manager The mechanism that maintains the data in the warehouse database.

Warehouse Technology Initiative (WTI)

An Oracle program that invites other vendors to offer products and services that are com-plementary to those offered by Oracle, par-ticularly in the area of products and services related to data warehousing.

.....................................................................................................................................................Glossary-16 Data Warehousing Fundamentals

.....................................................................................................................................................Glossary

WTI See Warehouse Technology Initiative.