-->
During the ETL process, data is extracted from an OLTP database, transformed to match the data warehouse schema, and loaded into the data warehouse database. Many data warehouses also incorporate data from non-OLTP systems, such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading.
In its simplest form, ETL is the process of copying data from one database to another. This simplicity is rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers.
When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation. ETL systems vary from data warehouse to data warehouse and even between department data marts within a data warehouse. A monolithic application, regardless of whether it is implemented in Transact-SQL or a traditional programming language, does not provide the flexibility for change necessary in ETL systems. A mixture of tools and technologies should be used to develop applications that each perform a specific ETL task.
The ETL process is not a one-time event; new data is added to a data warehouse periodically. Typical periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the data warehouse and the type of business it serves. Because ETL is an integral, ongoing, and recurring part of a data warehouse, ETL processes must be automated and operational procedures documented. ETL also changes and evolves as the data warehouse evolves, so ETL processes must be designed for ease of modification. A solid, well-designed, and documented ETL system is necessary for the success of a data warehouse project.
Data warehouses evolve to improve their service to the business and to adapt to changes in business processes and requirements. Business rules change as the business reacts to market influences—the data warehouse must respond in order to maintain its value as a tool for decision makers. The ETL implementation must adapt as the data warehouse evolves.
Microsoft® SQL Server™ 2000 provides significant enhancements to existing performance and capabilities, and introduces new features that make the development, deployment, and maintenance of ETL processes easier and simpler, and its performance faster.
Regardless of how they are implemented, all ETL systems have a common purpose: they move data from one database to another. Generally, ETL systems move data from OLTP systems to a data warehouse, but they can also be used to move data from one data warehouse to another. An ETL system consists of four distinct functional elements:
warehouse to maintain OLTP performance and efficiency. Legacy systems may require too much effort to implement such offload processes, so legacy data is often copied into the data warehouse, leaving the original data in place. Extracted data is loaded into the data warehouse staging area (a relational database usually separate from the data warehouse database), for manipulation by the remaining ETL processes.
Data extraction is generally performed within the source system itself, especially if it is a relational database to which extraction procedures can easily be added. It is also possible for the extraction logic to exist in the data warehouse staging area and query the source system for data using ODBC, OLE DB, or other APIs. For legacy systems, the most common method of data extraction is for the legacy system to produce text files, although many newer systems offer direct query APIs or accommodate access through ODBC or OLE DB.
Data extraction processes can be implemented using Transact-SQL stored procedures, Data Transformation Services (DTS) tasks, or custom applications developed in programming or scripting languages.
Tools used in the transformation element vary. Some data validation and data accuracy checking can be accomplished with straightforward Transact-SQL code. More complicated transformations can be implemented using DTS packages. The application of complex business rules often requires the development of sophisticated custom applications in various programming languages. You can use DTS packages to encapsulate multi-step transformations into a single task.
Listed below are some basic examples that illustrate the types of transformations performed by this element:
Data Validation
Check that all rows in the fact table match rows in dimension tables to enforce data integrity.
Data Accuracy
Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.
Data Type Conversion
Ensure that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system. For example, if one source system stores "off" or "on" in its status field and another source system stores "0" or "1" in its status field, then a data type conversion transformation converts the content of one or both of the fields to a specified common value such as "off" or "on".
Business Rule Application
Ensure that the rules of the business are enforced on the data stored in the warehouse. For example, check that all customer records contain values for both FirstName and LastName fields.
Examples of data warehouse meta data that can be recorded and used to analyze the activity and performance of a data warehouse include:
Regardless of their implementation, a number of design considerations are common to all ETL systems:
Modularity
ETL systems should contain modular elements that perform discrete tasks. This encourages reuse and makes them easy to modify when implementing changes in response to business and data warehouse changes. Monolithic systems should be avoided.
Consistency
ETL systems should guarantee consistency of data when it is loaded into the data warehouse. An entire data load should be treated as a single logical transaction—either the entire data load is successful or the entire load is rolled back. In some systems, the load is a single physical transaction, whereas in others it is a series of transactions. Regardless of the physical implementation, the data load should be treated as a single logical transaction.
Flexibility
ETL systems should be developed to meet the needs of the data warehouse and to accommodate the source data environments. It may be appropriate to accomplish some transformations in text files and some on the source data system; others may require the development of custom applications. A variety of technologies and techniques can be applied, using the tool most appropriate to the individual task of each ETL functional element.
Speed
ETL systems should be as fast as possible. Ultimately, the time window available for ETL processing is governed by data warehouse and source system schedules. Some data warehouse elements may have a huge processing window (days), while others may have a very limited processing window (hours). Regardless of the time available, it is important that the ETL system execute as rapidly as possible.
Heterogeneity
ETL systems should be able to work with a wide variety of data in different formats. An ETL system that only works with a single type of source data is useless.
Meta Data Management
ETL systems are arguably the single most important source of meta data about both the data in the data warehouse and data in the source system. Finally, the ETL process itself generates useful meta data that should be retained and analyzed regularly. Meta data is discussed in greater detail later in this chapter.
Before discussing the physical implementation of ETL systems, it is important to understand the different ETL architectures and how they relate to each other. Essentially, ETL systems can be classified in two architectures: the homogenous architecture and the heterogeneous architecture.
Most homogenous ETL architectures have the following characteristics:
The homogeneous ETL architecture
is generally applicable to data marts, especially those focused on a single
subject matter.
Most heterogeneous ETL architectures have the following characteristics:
Heterogeneous ETL architectures
are found more often in data warehouses than in data marts.
ETL development consists of two general phases: identifying and mapping data, and developing functional element implementations. Both phases should be carefully documented and stored in a central, easily accessible location, preferably in electronic form.
Each functional element contains steps that perform individual tasks, which may execute on one of several systems, such as the OLTP or legacy systems that contain the source data, the staging area database, or the data warehouse database. Various tools and techniques may be used to implement the steps in a single functional area, such as Transact-SQL, DTS packages, or custom applications developed in a programming language such as Microsoft Visual Basic®. Steps that are discrete in one functional element may be combined in another.
Data from a single data source usually requires different transformations for different portions of the incoming data. Fact table data transformations may include summarization, and will always require surrogate dimension keys to be added to the fact records. Data destined for dimension tables in the data warehouse may require one process to accomplish one type of update to a changing dimension and a different process for another type of update.
Transformations may be implemented using Transact-SQL, as is demonstrated in the code examples later in this chapter, DTS packages, or custom applications.
Regardless of the number and variety of transformations and their implementations, the transformation element is responsible for preparing data for loading into the data warehouse.
Confirm Success or Failure. A confirmation should be generated on the success or failure of the execution of the ETL processes. Ideally, this mechanism should exist for each task so that rollback mechanisms can be implemented to allow for incremental responses to errors.
Scheduling. ETL tasks should include the ability to be scheduled for execution. Scheduling mechanisms reduce repetitive manual operations and allow for maximum use of system resources during recurring periods of low activity.
SQL Server 2000 includes several components that aid in the development and maintenance of ETL systems:
In general, ETL operations should be performed on a relational database server separate from the source databases and the data warehouse database. A separate staging area database server creates a logical and physical separation between the source systems and the data warehouse, and minimizes the impact of the intense periodic ETL activity on source and data warehouse databases. If a separate database server is not available, a separate database on the data warehouse database server can be used for the ETL staging area. However, in this case it is essential to schedule periods of high ETL activity during times of low data warehouse user activity.
For small data warehouses with available excess performance and low user activity, it is possible to incorporate the ETL system into the data warehouse database. The advantage of this approach is that separate copies of data warehouse tables are not needed in the staging area. However, there is always some risk associated with performing transformations on live data, and ETL activities must be very carefully coordinated with data warehouse periods of minimum activity. When ETL is integrated into the data warehouse database, it is recommended that the data warehouse be taken offline when performing ETL transformations and loading.
Most systems can effectively stage data in a SQL Server 2000 database, as we describe in this chapter. An ETL system that needs to process extremely large volumes of data will need to use specialized tools and custom applications that operate on files rather than database tables. With extremely large volumes of data, it is not practical to load data into a staging database until it has been cleaned, aggregated, and stripped of meaningless information. Because it is much easier to build an ETL system using the standard tools and techniques that are described in this chapter, most experienced system designers will attempt to use a staging database, and move to custom tools only if data cannot be processed during the load window.
What does "extremely large" mean and when does it become infeasible to use standard DTS tasks and Transact-SQL scripts to process data from a staging database? The answer depends on the load window, the complexity of transformations, and the degree of data aggregation necessary to create the rows that are permanently stored in the data warehouse. As a conservative rule of thumb, if the transformation application needs to process more than 1 gigabyte of data in less than an hour, it may be necessary to consider specialized high performance techniques, which are outside the scope of this chapter.
This section provides general information about configuring the SQL Server 2000 database server and the database to support an ETL system staging area database with effective performance. ETL systems can vary greatly in their database server requirements; server configurations and performance option settings may differ significantly from one ETL system to another.
ETL data manipulation activities are similar in design and functionality to those of OLTP systems although ETL systems do not experience the constant activity associated with OLTP systems. Instead of constant activity, ETL systems have periods of high write activity followed by periods of little or no activity. Configuring a server and database to meet the needs of an ETL system is not as straightforward as configuring a server and database for an OLTP system.
For a detailed discussion of RAID and SQL Server 2000 performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."
Appropriate initial sizing of the data files can reduce the likelihood of SQL being forced to increase the size of the database, which eliminates an intensive I/O operation. If a data file is allowed to automatically grow, the file growth may be specified by a percentage or a set number value. The growth value can be specified in megabytes (MB), kilobytes (KB), or percent. If percent is specified, the increment size is the specified percentage of the file size at the time the increment occurs. If the data file is too small, the growth increments will be frequent. For example, if a data file is initially created at 10 MB and set to grow in 10 percent increments until it reaches 20 MB, SQL Server 2000 will perform eight auto-grow operations as the data file size increases to 20 MB. Therefore, it is recommended that a fixed MB value be chosen for data file growth increments.
Finally, if the server uses SCSI disks, special care should be paid to preventing disk space consumption from increasing beyond 85 percent of the capacity of the drive. Beyond 85 percent consumption, SCSI disk performance begins to degrade. Therefore, it is recommended that the data files for the database are set to grow automatically, but only to a predefined maximum size, which should be no more than 85 percent capacity of the drive.
The following table lists some database options and their setting that may be used to increase ETL performance.
Surrogate keys are critical to successful data warehouse design: they provide the means to maintain data warehouse information when dimensions change. For more information and details about surrogate keys, see Chapter 17, "Data Warehouse Design Considerations."
The following are some common characteristics of surrogate keys:
The SQL Server 2000 Identity
column provides an excellent surrogate key mechanism.
Code examples in these sections use the pubs sample database included with SQL Server 2000 to demonstrate various activities performed in ETL systems. The examples illustrate techniques for loading dimension tables in the data warehouse; they do not take into consideration separate procedures that may be required to update OLAP cubes or aggregation tables.
The use of temporary and staging tables in the ETL database allows the data extraction and loading process to be broken up into smaller segments of work that can be individually recovered. The temporary tables allow the source data to be loaded and transformed without impacting the performance of the source system except for what is necessary to extract the data. The staging tables provide a mechanism for data validation and surrogate key generation before loading transformed data into the data warehouse. Transformation, validation, and surrogate key management tasks should never be performed directly on dimension tables in the data warehouse.
The code examples in this chapter are presented as Transact-SQL, in order to communicate to the widest audience. A production ETL system would use DTS to perform this work. A very simple system may use several Execute SQL tasks linked within a package. More complex systems divide units of work into separate packages, and call those subpackages from a master package. For a detailed explanation of how to use DTS to implement the functionality described in this chapter, please see SQL Server Books Online.
Code Example 19.1
Code Example 19.2
The following Transact-SQL code demonstrates a simple technique for loading new rows into the Authors dimension. This example assumes that there is a primary key on the source system that we can use and it assumes that we do not have a change set.
Code Example 19.3
Type 1 and Type 2 dimension
changes are discussed in this section. Type 3 changes are not recommended for
most data warehouse applications and are not discussed here. For more
information and details about slowly changing dimensions, see Chapter 17,
"Data Warehouse Design Considerations."
Type 1 and Type 2 dimension change techniques are used when dimension attributes change in records that already exist in the data warehouse. The techniques for inserting new records into dimensions (discussed earlier in the section "Inserting New Dimension Records") apply to all dimensions regardless of whether changes to dimension attributes are incorporated using Type 1 or Type 2 change techniques.
The code examples in the following sections demonstrate techniques for managing Type 1 and Type 2 dimension changes. The examples have been kept simple to maintain clarity for technique illustration purposes—the examples assume that all changes for a dimension will be of the same type, whereas, in reality, most dimensions include some attributes that require Type 2 changes and other attributes that can be maintained using Type 1 changes. For example, a retailer may decide that a change in the marital status of a customer should be treated as a Type 2 change, whereas a change of street address for the same customer should be treated as a Type 1 change. Therefore, it is important to document all of the attributes in a dimension and, for each attribute, whether a value change should be applied as Type 1 or a Type 2 change.
Type 1 dimension changes are straightforward to implement. The following Transact-SQL code demonstrates a simple Type 1 technique for updating existing rows in the Authors dimension. For this example, we will change some data in the Authors_Temp table to simulate changed records received as a result of updates to the authors table in the source database. The value for the Contract field is assumed to be eligible for Type 1 changes in this example. In a later section, the Contract field will be updated using a Type 2 change. The following example assumes that there is a primary key on the source system that we can use and it assumes that we do not have a change set:
Code Example 19.4
Although it is relatively straightforward to implement Type 2 change techniques in the ETL process to manage slowly changing dimensions, the data associated with a dimension member becomes fragmented as such changes are made. Data warehouse analysis and reporting tools must be capable of summarizing data correctly for dimensions that include Type 2 changes. To minimize unnecessary fragmentation, a Type 2 change should not be used if a Type 1 change is appropriate.
The techniques used to insert new records into a Type 2 dimension are the same as the ones used to insert new records into a Type 1 dimension. However, the techniques used to track updates to dimension records are different.
The following Transact-SQL code demonstrates a simple technique for applying Type 2 changes for existing rows in the Authors dimension. Unlike a Type 1 change, existing records are not updated in a Type 2 dimension. Instead, new records are added to the dimension to contain the changes to the source system records. In this example, the values of the contract field changed in the Type 1 example are changed to different values and we now assume the contract field is to be managed as a Type 2 change.
Notice that the Transact-SQL statement used to load updated records into the staging table is the same as the one used to insert new records into the staging table except that the predicate clause in the two statements differ. When loading new records, the WHERE clause uses the auth_id field to determine which records are new. When inserting records for Type 2 changes, the WHERE clause causes new records to be added when the value of the attribute of interest (contract) in the temporary table differs from the attribute value in the staging table.
Code Example 19.5
For purposes of these examples, a table (Fact_Source) is created that simulates a data table in a source system from which fact data can be extracted. The Fact_Source table data is a combination of data found in the Sales and TitleAuthor tables in the pubs database.
The following table lists definitions of the tables created for use with the examples that follow.
The following Transact-SQL
statements create the tables defined above:
Code Example 19.6
Code Example 19.7
The following code loads the Fact_Staging table. Notice that the Store_Key and Title_Key columns that are used for surrogate keys contain zeros when the data is first loaded into the staging table. This is because NULLs are not allowed in these columns. The prevention of NULLs allows for a very clean data load and it negates the need to do NULL logic checks in the ETL code or the reporting system. The zeros in the column also provide an easy mechanism for locating invalid data in the dimension and fact table data. If a zero appears in either column in the final fact table, then the ETL logic failed to handle a dimension attribute. It is good practice to always add a dimension record with zero key and assign it the description of "unknown." This helps preserve relational integrity in the data warehouse and allows reporting systems to display the invalid data, so that corrections can be made to the ETL logic or the source data.
Code Example 19.8
Code Example 19.9
Code Example 19.10
Code Example 19.11
Code Example 19.12
One of the key design decisions in the above techniques centers on the use of the staging tables. In the techniques illustrated above, the staging tables are exact copies of the final data warehouse dimension tables. However, the efficiency of the above techniques decreases as the number of rows in the dimension increase due to records added for Type 2 changes. For very large dimensions (millions of rows), the above technique will require massive amounts of processing power to complete. Therefore, for large dimensions, we need to introduce a variation of the above technique that will allow the system to scale with the data warehouse.
This variation involves creating a "current version" dimension table for use in the ETL process that contains only a single row for each of the dimension members. This record contains the current attributes of the dimension member. For example, if we have a Type 2 dimension for stores, and the data for the store Bookbeat has undergone three Type 2 changes, then the current version table would not contain all four records for the store. Instead, the table contains a single row for Bookbeat that contains all of the current information for it, including the current surrogate key value for the dimension member. This creates a smaller table with fewer rows that allows for faster access during the ETL process.
The following code incorporates a Store_Current table to demonstrate this technique for the Stores dimension. The table below describes each of the tables used in the example.
Code Example 19.14
Code Example 19.15
Code Example 19.16
Code Example 19.17
Code Example 19.18
In the code examples that follow, the terms "Job" and "Step" are used with the following meanings:
Designing meta data storage
requires careful planning and implementation. There are dependencies between
tables and order of precedence constraints on records. However, the meta data
information generated by ETL activities is critical to the success of the data
warehouse. Following is a sample set of tables that can be used to track meta
data for ETL activities.
tblAdmin_Job_Master
This table lists all of the jobs that are used to populate the data warehouse. These are the fields in tblAdmin_Job_Master:
tblAdmin_Step_Master
This table lists all of the steps in a job. These are the fields in tblAdmin_Step_Master:
Note In SQL Server 2000, only the error number can be trapped, not the generated error message.
tblAdmin_Audit_Errors
This table lists all of the errors that are generated during a job step. These are the fields in tblAdmin_Audit_Errors:
Code Example 19.21
Therefore, storing the threshold with the step (a Type 2 dimension) allows us to maintain historical execution records without affecting their integrity if the master step information is changed. For example, if a step initially loads 1,000 rows but over time the number of rows increases to 1 million, the acceptable threshold information for that step must be changed as well. If the threshold data is stored only in the tblAdmin_Step_Master table and not stored with each record, the context of the data will be lost, which can cause inaccuracies in reports built on the meta data information. For simplicity, to illustrate the technique, the sample code does not maintain threshold information automatically. In order to change the threshold information for a step, an administrator will need to modify the master step record manually. However, it would be possible to automate this process.
Code Example 19.22
Code Example 19.23
Code Example 19.24
Code Example 19.25
Code Example 19.26
The ETL system efficiently extracts data from its sources, transforms and sometimes aggregates data to match the target data warehouse schema, and loads the transformed data into the data warehouse database. A well-designed ETL system supports automated operation that informs operators of errors with the appropriate level of warning. SQL Server 2000 Data Transformation 1048527521Services can be used to manage the ETL operations, regardless of the techniques used to implement individual ETL tasks.
While it is tempting to perform some transformation on data as it is extracted from the source system, the best practice is to isolate transformations within the transformation modules. In general, the data extraction code should be designed to minimize the impact on the source system databases.
In most applications, the key to efficient transformation is to use a SQL Server 2000 database for staging. Once extracted data has been loaded into a staging database, the powerful SQL Server 2000 database engine is used to perform complex transformations.
The process of loading fact table data from the staging area into the target data warehouse should use bulk load techniques. Dimension table data is usually small in volume, which makes bulk loading less important for dimension table loading.
The ETL system is a primary source of meta data that can be used to track information about the operation and performance of the data warehouse as well as the ETL processes.
Data Extraction, Transformation, and Loading Techniques
Introduction
During the ETL process, data is extracted from an OLTP database, transformed to match the data warehouse schema, and loaded into the data warehouse database. Many data warehouses also incorporate data from non-OLTP systems, such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading.
In its simplest form, ETL is the process of copying data from one database to another. This simplicity is rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers.
When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation. ETL systems vary from data warehouse to data warehouse and even between department data marts within a data warehouse. A monolithic application, regardless of whether it is implemented in Transact-SQL or a traditional programming language, does not provide the flexibility for change necessary in ETL systems. A mixture of tools and technologies should be used to develop applications that each perform a specific ETL task.
The ETL process is not a one-time event; new data is added to a data warehouse periodically. Typical periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the data warehouse and the type of business it serves. Because ETL is an integral, ongoing, and recurring part of a data warehouse, ETL processes must be automated and operational procedures documented. ETL also changes and evolves as the data warehouse evolves, so ETL processes must be designed for ease of modification. A solid, well-designed, and documented ETL system is necessary for the success of a data warehouse project.
Data warehouses evolve to improve their service to the business and to adapt to changes in business processes and requirements. Business rules change as the business reacts to market influences—the data warehouse must respond in order to maintain its value as a tool for decision makers. The ETL implementation must adapt as the data warehouse evolves.
Microsoft® SQL Server™ 2000 provides significant enhancements to existing performance and capabilities, and introduces new features that make the development, deployment, and maintenance of ETL processes easier and simpler, and its performance faster.
ETL Functional Elements
Regardless of how they are implemented, all ETL systems have a common purpose: they move data from one database to another. Generally, ETL systems move data from OLTP systems to a data warehouse, but they can also be used to move data from one data warehouse to another. An ETL system consists of four distinct functional elements:
•
|
Extraction |
•
|
Transformation |
•
|
Loading |
•
|
Meta data |
Extraction
The ETL extraction element is responsible for extracting data from the source system. During extraction, data may be removed from the source system or a copy made and the original data retained in the source system. It is common to move historical data that accumulates in an operational OLTP system to a datawarehouse to maintain OLTP performance and efficiency. Legacy systems may require too much effort to implement such offload processes, so legacy data is often copied into the data warehouse, leaving the original data in place. Extracted data is loaded into the data warehouse staging area (a relational database usually separate from the data warehouse database), for manipulation by the remaining ETL processes.
Data extraction is generally performed within the source system itself, especially if it is a relational database to which extraction procedures can easily be added. It is also possible for the extraction logic to exist in the data warehouse staging area and query the source system for data using ODBC, OLE DB, or other APIs. For legacy systems, the most common method of data extraction is for the legacy system to produce text files, although many newer systems offer direct query APIs or accommodate access through ODBC or OLE DB.
Data extraction processes can be implemented using Transact-SQL stored procedures, Data Transformation Services (DTS) tasks, or custom applications developed in programming or scripting languages.
Transformation
The ETL transformation element is responsible for data validation, data accuracy, data type conversion, and business rule application. It is the most complicated of the ETL elements. It may appear to be more efficient to perform some transformations as the data is being extracted (inline transformation); however, an ETL system that uses inline transformations during extraction is less robust and flexible than one that confines transformations to the transformation element. Transformations performed in the OLTP system impose a performance burden on the OLTP database. They also split the transformation logic between two ETL elements and add maintenance complexity when the ETL logic changes.Tools used in the transformation element vary. Some data validation and data accuracy checking can be accomplished with straightforward Transact-SQL code. More complicated transformations can be implemented using DTS packages. The application of complex business rules often requires the development of sophisticated custom applications in various programming languages. You can use DTS packages to encapsulate multi-step transformations into a single task.
Listed below are some basic examples that illustrate the types of transformations performed by this element:
Data Validation
Check that all rows in the fact table match rows in dimension tables to enforce data integrity.
Data Accuracy
Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.
Data Type Conversion
Ensure that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system. For example, if one source system stores "off" or "on" in its status field and another source system stores "0" or "1" in its status field, then a data type conversion transformation converts the content of one or both of the fields to a specified common value such as "off" or "on".
Business Rule Application
Ensure that the rules of the business are enforced on the data stored in the warehouse. For example, check that all customer records contain values for both FirstName and LastName fields.
Loading
The ETL loading element is responsible for loading transformed data into the data warehouse database. Data warehouses are usually updated periodically rather than continuously, and large numbers of records are often loaded to multiple tables in a single data load. The data warehouse is often taken offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy API are the best tools for data loading operations. The design of the loading element should focus on efficiency and performance to minimize the data warehouse offline time. For more information and details about performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."Meta Data
The ETL meta data functional element is responsible for maintaining information (meta data) about the movement and transformation of data, and the operation of the data warehouse. It also documents the data mappings used during the transformations. Meta data logging provides possibilities for automated administration, trend prediction, and code reuse.Examples of data warehouse meta data that can be recorded and used to analyze the activity and performance of a data warehouse include:
•
|
Data Lineage, such as the time that a particular set of records was loaded into the data warehouse. |
•
|
Schema Changes, such as changes to table definitions. |
•
|
Data Type Usage, such as identifying all tables that use the "Birthdate" user-defined data type. |
•
|
Transformation Statistics, such as the execution time of each stage of a transformation, the number of rows processed by the transformation, the last time the transformation was executed, and so on. |
•
|
DTS Package Versioning, which can be used to view, branch, or retrieve any historical version of a particular DTS package. |
•
|
Data Warehouse Usage Statistics, such as query times for reports. |
ETL Design Considerations
Regardless of their implementation, a number of design considerations are common to all ETL systems:
Modularity
ETL systems should contain modular elements that perform discrete tasks. This encourages reuse and makes them easy to modify when implementing changes in response to business and data warehouse changes. Monolithic systems should be avoided.
Consistency
ETL systems should guarantee consistency of data when it is loaded into the data warehouse. An entire data load should be treated as a single logical transaction—either the entire data load is successful or the entire load is rolled back. In some systems, the load is a single physical transaction, whereas in others it is a series of transactions. Regardless of the physical implementation, the data load should be treated as a single logical transaction.
Flexibility
ETL systems should be developed to meet the needs of the data warehouse and to accommodate the source data environments. It may be appropriate to accomplish some transformations in text files and some on the source data system; others may require the development of custom applications. A variety of technologies and techniques can be applied, using the tool most appropriate to the individual task of each ETL functional element.
Speed
ETL systems should be as fast as possible. Ultimately, the time window available for ETL processing is governed by data warehouse and source system schedules. Some data warehouse elements may have a huge processing window (days), while others may have a very limited processing window (hours). Regardless of the time available, it is important that the ETL system execute as rapidly as possible.
Heterogeneity
ETL systems should be able to work with a wide variety of data in different formats. An ETL system that only works with a single type of source data is useless.
Meta Data Management
ETL systems are arguably the single most important source of meta data about both the data in the data warehouse and data in the source system. Finally, the ETL process itself generates useful meta data that should be retained and analyzed regularly. Meta data is discussed in greater detail later in this chapter.
ETL Architectures
Before discussing the physical implementation of ETL systems, it is important to understand the different ETL architectures and how they relate to each other. Essentially, ETL systems can be classified in two architectures: the homogenous architecture and the heterogeneous architecture.
Homogenous Architecture
A homogenous architecture for an ETL system is one that involves only a single source system and a single target system. Data flows from the single source of data through the ETL processes and is loaded into the data warehouse, as shown in the following diagram.Most homogenous ETL architectures have the following characteristics:
•
|
Single data source: Data is extracted from a single source system, such as an OLTP system. |
•
|
Rapid development: The development effort required to extract the data is straightforward because there is only one data format for each record type. |
•
|
Light data transformation: No data transformations are required to achieve consistency among disparate data formats, and the incoming data is often in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs and other formatting transformations. |
•
|
Light structural transformation: Because the data comes from a single source, the amount of structural changes such as table alteration is also very light. The structural changes typically involve denormalization efforts to meet data warehouse schema requirements. |
•
|
Simple research requirements: The research efforts to locate data are generally simple: if the data is in the source system, it can be used. If it is not, it cannot. |
Heterogeneous Architecture
A heterogeneous architecture for an ETL system is one that extracts data from multiple sources, as shown in the following diagram. The complexity of this architecture arises from the fact that data from more than one source must be merged, rather than from the fact that data may be formatted differently in the different sources. However, significantly different storage formats and database schemas do provide additional complications.Most heterogeneous ETL architectures have the following characteristics:
•
|
Multiple data sources. |
•
|
More complex development: The development effort required to extract the data is increased because there are multiple source data formats for each record type. |
•
|
Significant data transformation: Data transformations are required to achieve consistency among disparate data formats, and the incoming data is often not in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs, additional data formatting, data conversions, lookups, computations, and referential integrity verification. Precomputed calculations may require combining data from multiple sources, or data that has multiple degrees of granularity, such as allocating shipping costs to individual line items. |
•
|
Significant structural transformation: Because the data comes from multiple sources, the amount of structural changes, such as table alteration, is significant. |
•
|
Substantial research requirements to identify and match data elements. |
ETL Development
ETL development consists of two general phases: identifying and mapping data, and developing functional element implementations. Both phases should be carefully documented and stored in a central, easily accessible location, preferably in electronic form.
Identify and Map Data
This phase of the development process identifies sources of data elements, the targets for those data elements in the data warehouse, and the transformations that must be applied to each data element as it is migrated from its source to its destination. High level data maps should be developed during the requirements gathering and data modeling phases of the data warehouse project. During the ETL system design and development process, these high level data maps are extended to thoroughly specify system details.Identify Source Data
For some systems, identifying the source data may be as simple as identifying the server where the data is stored in an OLTP database and the storage type (SQL Server database, Microsoft Excel spreadsheet, or text file, among others). In other systems, identifying the source may mean preparing a detailed definition of the meaning of the data, such as a business rule, a definition of the data itself, such as decoding rules (O = On, for example), or even detailed documentation of a source system for which the system documentation has been lost or is not current.Identify Target Data
Each data element is destined for a target in the data warehouse. A target for a data element may be an attribute in a dimension table, a numeric measure in a fact table, or a summarized total in an aggregation table. There may not be a one-to-one correspondence between a source data element and a data element in the data warehouse because the destination system may not contain the data at the same granularity as the source system. For example, a retail client may decide to roll data up to the SKU level by day rather than track individual line item data. The level of item detail that is stored in the fact table of the data warehouse is called the grain of the data. If the grain of the target does not match the grain of the source, the data must be summarized as it moves from the source to the target.Map Source Data to Target Data
A data map defines the source fields of the data, the destination fields in the data warehouse and any data modifications that need to be accomplished to transform the data into the desired format for the data warehouse. Some transformations require aggregating the source data to a coarser granularity, such as summarizing individual item sales into daily sales by SKU. Other transformations involve altering the source data itself as it moves from the source to the target. Some transformations decode data into human readable form, such as replacing "1" with "on" and "0" with "off" in a status field. If two source systems encode data destined for the same target differently (for example, a second source system uses Yes and No for status), a separate transformation for each source system must be defined. Transformations must be documented and maintained in the data maps. The relationship between the source and target systems is maintained in a map that is referenced to execute the transformation of the data before it is loaded in the data warehouse.Develop Functional Elements
Design and implementation of the four ETL functional elements, Extraction, Transformation, Loading, and meta data logging, vary from system to system. There will often be multiple versions of each functional element.Each functional element contains steps that perform individual tasks, which may execute on one of several systems, such as the OLTP or legacy systems that contain the source data, the staging area database, or the data warehouse database. Various tools and techniques may be used to implement the steps in a single functional area, such as Transact-SQL, DTS packages, or custom applications developed in a programming language such as Microsoft Visual Basic®. Steps that are discrete in one functional element may be combined in another.
Extraction
The extraction element may have one version to extract data from one OLTP data source, a different version for a different OLTP data source, and multiple versions for legacy systems and other sources of data. This element may include tasks that execute SELECT queries from the ETL staging database against a source OLTP system, or it may execute some tasks on the source system directly and others in the staging database, as in the case of generating a flat file from a legacy system and then importing it into tables in the ETL database. Regardless of methods or number of steps, the extraction element is responsible for extracting the required data from the source system and making it available for processing by the next element.Transformation
Frequently a number of different transformations, implemented with various tools or techniques, are required to prepare data for loading into the data warehouse. Some transformations may be performed as data is extracted, such as an application on a legacy system that collects data from various internal files as it produces a text file of data to be further transformed. However, transformations are best accomplished in the ETL staging database, where data from several data sources may require varying transformations specific to the incoming data organization and format.Data from a single data source usually requires different transformations for different portions of the incoming data. Fact table data transformations may include summarization, and will always require surrogate dimension keys to be added to the fact records. Data destined for dimension tables in the data warehouse may require one process to accomplish one type of update to a changing dimension and a different process for another type of update.
Transformations may be implemented using Transact-SQL, as is demonstrated in the code examples later in this chapter, DTS packages, or custom applications.
Regardless of the number and variety of transformations and their implementations, the transformation element is responsible for preparing data for loading into the data warehouse.
Loading
The loading element typically has the least variety of task implementations. After the data from the various data sources has been extracted, transformed, and combined, the loading operation consists of inserting records into the various data warehouse database dimension and fact tables. Implementation may vary in the loading tasks, such as using BULK INSERT, bcp, or the Bulk Copy API. The loading element is responsible for loading data into the data warehouse database tables.Meta Data Logging
Meta data is collected from a number of the ETL operations. The meta data logging implementation for a particular ETL task will depend on how the task is implemented. For a task implemented by using a custom application, the application code may produce the meta data. For tasks implemented by using Transact-SQL, meta data can be captured with Transact-SQL statements in the task processes. The meta data logging element is responsible for capturing and recording meta data that documents the operation of the ETL functional areas and tasks, which includes identification of data that moves through the ETL system as well as the efficiency of ETL tasks.Common Tasks
Each ETL functional element should contain tasks that perform the following functions, in addition to tasks specific to the functional area itself:Confirm Success or Failure. A confirmation should be generated on the success or failure of the execution of the ETL processes. Ideally, this mechanism should exist for each task so that rollback mechanisms can be implemented to allow for incremental responses to errors.
Scheduling. ETL tasks should include the ability to be scheduled for execution. Scheduling mechanisms reduce repetitive manual operations and allow for maximum use of system resources during recurring periods of low activity.
SQL Server 2000 ETL Components
SQL Server 2000 includes several components that aid in the development and maintenance of ETL systems:
•
|
Data Transformation Services (DTS): SQL Server 2000 DTS is a set of graphical tools and programmable objects that lets you extract, transform, and consolidate data from disparate sources into single or multiple destinations. |
•
|
SQL Server Agent: SQL Server Agent provides features that support the scheduling of periodic activities on SQL Server 2000, or the notification to system administrators of problems that have occurred with the server. |
•
|
Stored Procedures and Views: Stored procedures assist in achieving a consistent implementation of logic across applications. The Transact-SQL statements and logic needed to perform a commonly performed task can be designed, coded, and tested once in a stored procedure. A view can be thought of as either a virtual table or a stored query. The data accessible through a view is not stored in the database as a distinct object; only the SELECT statement for the view is stored in the database. |
•
|
Transact SQL: Transact-SQL is a superset of the SQL standard that provides powerful programming capabilities that include loops, variables, and other programming constructs. |
•
|
OLE DB: OLE DB is a low-level interface to data. It is an open specification designed to build on the success of ODBC by providing an open standard for accessing all kinds of data. |
•
|
Meta Data Services: SQL Server 2000 Meta Data Services provides a way to store and manage meta data about information systems and applications. This technology serves as a hub for data and component definitions, development and deployment models, reusable software components, and data warehousing descriptions. |
The ETL Staging Database
In general, ETL operations should be performed on a relational database server separate from the source databases and the data warehouse database. A separate staging area database server creates a logical and physical separation between the source systems and the data warehouse, and minimizes the impact of the intense periodic ETL activity on source and data warehouse databases. If a separate database server is not available, a separate database on the data warehouse database server can be used for the ETL staging area. However, in this case it is essential to schedule periods of high ETL activity during times of low data warehouse user activity.
For small data warehouses with available excess performance and low user activity, it is possible to incorporate the ETL system into the data warehouse database. The advantage of this approach is that separate copies of data warehouse tables are not needed in the staging area. However, there is always some risk associated with performing transformations on live data, and ETL activities must be very carefully coordinated with data warehouse periods of minimum activity. When ETL is integrated into the data warehouse database, it is recommended that the data warehouse be taken offline when performing ETL transformations and loading.
Most systems can effectively stage data in a SQL Server 2000 database, as we describe in this chapter. An ETL system that needs to process extremely large volumes of data will need to use specialized tools and custom applications that operate on files rather than database tables. With extremely large volumes of data, it is not practical to load data into a staging database until it has been cleaned, aggregated, and stripped of meaningless information. Because it is much easier to build an ETL system using the standard tools and techniques that are described in this chapter, most experienced system designers will attempt to use a staging database, and move to custom tools only if data cannot be processed during the load window.
What does "extremely large" mean and when does it become infeasible to use standard DTS tasks and Transact-SQL scripts to process data from a staging database? The answer depends on the load window, the complexity of transformations, and the degree of data aggregation necessary to create the rows that are permanently stored in the data warehouse. As a conservative rule of thumb, if the transformation application needs to process more than 1 gigabyte of data in less than an hour, it may be necessary to consider specialized high performance techniques, which are outside the scope of this chapter.
This section provides general information about configuring the SQL Server 2000 database server and the database to support an ETL system staging area database with effective performance. ETL systems can vary greatly in their database server requirements; server configurations and performance option settings may differ significantly from one ETL system to another.
ETL data manipulation activities are similar in design and functionality to those of OLTP systems although ETL systems do not experience the constant activity associated with OLTP systems. Instead of constant activity, ETL systems have periods of high write activity followed by periods of little or no activity. Configuring a server and database to meet the needs of an ETL system is not as straightforward as configuring a server and database for an OLTP system.
For a detailed discussion of RAID and SQL Server 2000 performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."
Server Configuration
Disk storage system performance is one of the most critical factors in the performance of database systems. Server configuration options offer additional methods for adjusting server performance.RAID
As with any OLTP system, the RAID level for the disk drives on the server can make a considerable performance difference. For maximum performance of an ETL database, the disk drives for the server computer should be configured with RAID 1 or RAID 10. Additionally, it is recommended that the transaction logs, databases, and tempdb be placed on separate physical drives. Finally, if the hardware controller supports write caching, it is recommended that write caching be enabled. However, be sure to use a caching controller that guarantees that the controller cache contents will be written to disk in case of a system failure.Server Configuration Options (sp_configure)
No specific changes need to be made to the server configuration options in order to optimize performance for an ETL system. It is recommended that these options be left at their default settings unless there is a specific reason to modify them.Database Configuration
In SQL Server 2000, database performance can be tuned by proper selection of settings for data file growth and adjusting database configuration options.Data File Growth
When creating a database, an initial size for the data files for the database and transaction log must be specified. By default, SQL Server 2000 allows the data files to grow as much as necessary until disk space is exhausted. It is important to size the database appropriately before loading any data into it to avoid the I/O intensive operation of auto-growing data files. Failure to appropriately size the data files initially for the database means that SQL will be forced to frequently increase the size of the data files, which will degrade performance of the ETL processes.Appropriate initial sizing of the data files can reduce the likelihood of SQL being forced to increase the size of the database, which eliminates an intensive I/O operation. If a data file is allowed to automatically grow, the file growth may be specified by a percentage or a set number value. The growth value can be specified in megabytes (MB), kilobytes (KB), or percent. If percent is specified, the increment size is the specified percentage of the file size at the time the increment occurs. If the data file is too small, the growth increments will be frequent. For example, if a data file is initially created at 10 MB and set to grow in 10 percent increments until it reaches 20 MB, SQL Server 2000 will perform eight auto-grow operations as the data file size increases to 20 MB. Therefore, it is recommended that a fixed MB value be chosen for data file growth increments.
Finally, if the server uses SCSI disks, special care should be paid to preventing disk space consumption from increasing beyond 85 percent of the capacity of the drive. Beyond 85 percent consumption, SCSI disk performance begins to degrade. Therefore, it is recommended that the data files for the database are set to grow automatically, but only to a predefined maximum size, which should be no more than 85 percent capacity of the drive.
Database Configuration Options
Several database options can be adjusted to enhance the performance of an ETL database. For a complete discussion of these options, see SQL Server Books Online. For more information about database performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."The following table lists some database options and their setting that may be used to increase ETL performance.
Option name
|
Setting
|
AUTO_CREATE_STATISTICS
|
Off
|
AUTO_UPDATE_STATISTICS
|
On
|
AUTO_SHRINK
|
Off
|
CURSOR_DEFAULT
|
LOCAL
|
RECOVERY
Option
|
Bulk_Loaded
|
TORN_PAGE_DETECTION
|
On
|
Caution Different recovery model options introduce
varying degrees of risk of data loss. It is imperative that the risks be
thoroughly understood before choosing a recovery model.
Managing Surrogate Keys
Surrogate keys are critical to successful data warehouse design: they provide the means to maintain data warehouse information when dimensions change. For more information and details about surrogate keys, see Chapter 17, "Data Warehouse Design Considerations."
The following are some common characteristics of surrogate keys:
•
|
Used as the primary key for each dimension table, instead of the original key used in the source data system. The original key for each record is carried in the table but is not used as the primary key. |
•
|
May be defined as the primary key for the fact table. In general, the fact table uses a composite primary key composed of the dimension foreign key columns, with no surrogate key. In schemas with many dimensions, load and query performance will improve substantially if a surrogate key is used. If the fact table is defined with a surrogate primary key and no unique index on the composite key, the ETL application must be careful to ensure row uniqueness outside the database. A third possibility for the fact table is to define no primary key at all. While there are systems for which this is the most effective approach, it is not good database practice and should be considered with caution. |
•
|
Contains no meaningful business information; its only purpose is to uniquely identify each row. There is one exception: the primary key for a time dimension table provides human-readable information in the format "yyyymmdd ". |
•
|
Is a simple key on a single column, not a composite key. |
•
|
Should be numeric, preferably integer, and not text. |
•
|
Should never be a GUID. |
ETL Code Examples
Code examples in these sections use the pubs sample database included with SQL Server 2000 to demonstrate various activities performed in ETL systems. The examples illustrate techniques for loading dimension tables in the data warehouse; they do not take into consideration separate procedures that may be required to update OLAP cubes or aggregation tables.
The use of temporary and staging tables in the ETL database allows the data extraction and loading process to be broken up into smaller segments of work that can be individually recovered. The temporary tables allow the source data to be loaded and transformed without impacting the performance of the source system except for what is necessary to extract the data. The staging tables provide a mechanism for data validation and surrogate key generation before loading transformed data into the data warehouse. Transformation, validation, and surrogate key management tasks should never be performed directly on dimension tables in the data warehouse.
The code examples in this chapter are presented as Transact-SQL, in order to communicate to the widest audience. A production ETL system would use DTS to perform this work. A very simple system may use several Execute SQL tasks linked within a package. More complex systems divide units of work into separate packages, and call those subpackages from a master package. For a detailed explanation of how to use DTS to implement the functionality described in this chapter, please see SQL Server Books Online.
Tables for Code Examples
The examples use the authors table in the pubs database as the source of data. The following three tables are created for use by the code examples.
Table name
|
Purpose
|
Authors_Temp
|
Holds the data
imported from the source system.
|
Authors_Staging
|
Holds the
dimension data while it is being updated. The data for the authors will be updated
in this table and then the data will be loaded into the data warehouse
dimension table.
|
Authors_DW
|
Simulates the
Authors dimension table in the data warehouse.
|
These are
key points regarding the structures of these tables:
•
|
There is no difference between the structure of the authors table in the pubs database and the Authors_Temp table in the staging area. This allows for straightforward extraction of data from the source system with minimum impact on source system performance. |
•
|
The Author_Staging table is used to generate the surrogate key (Author_Key column) that is used by the data warehouse. This table is also used to validate any data changes, convert data types, and perform any other transformations necessary to prepare the data for loading into the data warehouse. |
•
|
The structure of the Author_Staging table in the staging area is the same as that of the Authors_DW table in the data warehouse. This allows for straightforward loading of the dimension data from the staging database to the data warehouse. If the dimension table in the data warehouse is small enough, it can be truncated and replaced with data from the staging table. In many data warehouses, dimension tables are too large to be efficiently updated by dropping and reloading them in their entirety. In this case, the tables in both the staging area and data warehouse should contain a datetime column, which can be used to determine which records need to be updated, inserted, or deleted in the data warehouse table. |
•
|
The staging and data warehouse tables are identical after the data is loaded into the data warehouse. This fact can be considered for use in backup strategy planning. |
Define Example Tables
The following three Transact-SQL statements create the Authors_Temp and Authors_Staging tables, and the Authors_DW table that simulates the Authors dimension table in the data warehouse:Code Example 19.1
CREATE TABLE [Authors_Temp] (
[au_id] [varchar] (11) PRIMARY KEY CLUSTERED,
[au_lname] [varchar] (40) DEFAULT ('Missing'),
[au_fname] [varchar] (20) DEFAULT ('Missing'),
[phone] [char] (12) DEFAULT ('000-000-0000'),
[address] [varchar] (40) DEFAULT ('Missing'),
[city] [varchar] (20) DEFAULT ('Missing'),
[state] [char] (2) DEFAULT ('XX'),
[zip] [char] (5) DEFAULT ('00000'),
[contract] [bit] NOT NULL DEFAULT (0))
ON [PRIMARY]
GO
CREATE TABLE Authors_Staging (
[Author_Key] int NOT NULL IDENTITY (1,1) PRIMARY KEY CLUSTERED ,
[au_id] varchar (11) NOT NULL ,
[au_lname] varchar (40) NOT NULL DEFAULT ('Missing'),
[au_fname] varchar (20) NOT NULL DEFAULT ('Missing'),
[phone] char (12) NOT NULL DEFAULT ('000-000-0000'),
[address] varchar (40) NULL DEFAULT ('Missing'),
[city] varchar (20) NOT NULL DEFAULT ('Missing'),
[state] char (2) NOT NULL DEFAULT ('XX') ,
[zip] char (5) NOT NULL DEFAULT ('00000') ,
[contract] bit NOT NULL,
[DateCreated] smalldatetime NOT NULL DEFAULT (getdate()),
[DateUpdated] smalldatetime NOT NULL DEFAULT (getdate())
)ON [PRIMARY]
GO
CREATE TABLE [Authors_DW] (
[Author_Key] [int] NOT NULL PRIMARY KEY CLUSTERED,
[au_id] [varchar] (11) NOT NULL,
[au_lname] [varchar] (40) NOT NULL DEFAULT ('Missing'),
[au_fname] [varchar] (20) NOT NULL DEFAULT ('Missing'),
[phone] [char] (12) NOT NULL DEFAULT ('000-000-0000'),
[address] [varchar] (40) NULL DEFAULT ('Missing'),
[city] [varchar] (20) NOT NULL DEFAULT ('Missing'),
[state] [char] (2) NOT NULL DEFAULT ('XX'),
[zip] [char] (5) NOT NULL DEFAULT ('00000'),
[contract] [bit] NOT NULL,
[DateCreated] smalldatetime NOT NULL DEFAULT (getdate()),
[DateUpdated] smalldatetime NOT NULL DEFAULT (getdate())
) ON [PRIMARY]
GO
Populate Example Tables
The following three Transact-SQL statements populate the temporary, staging, and data warehouse sample tables by loading all of the author records except the record for author Johnson White, which will be inserted later to illustrate a technique for adding records to the data warehouse dimension table:Code Example 19.2
--Populate the Authors_Temp table with all author records except Johnson White's
INSERT INTO Authors_Temp
SELECT * FROM Authors
WHERE AU_ID <> '172-32-1176'
GO
--Populate the Authors_Staging table from Authors_Temp
INSERT INTO Authors_Staging (au_id, au_lname, au_fname, phone, address, city, state,
zip, contract)
SELECT au_id, au_lname, au_fname, phone, address, city, state, zip, contract
FROM Authors_Temp
GO
--Populate the simulated data warehouse dimension table, Authors_DW
INSERT INTO Authors_DW
SELECT * FROM Authors_Staging
GO
The contents of the three tables
now simulate the state following the completion of all previous ETL processing
before the author Johnson White is added to the source data table. Inserting New Dimension Records
Loading new author records is a relatively simple task. If the extraction method is capable of generating a change set (a set of records that have been altered since the last data extraction) from the source system, we load the change set into the temporary table. If we cannot generate a change set from the source system, we will have to load the entire data set from the source system into the temporary table, even if only a single record has changed.The following Transact-SQL code demonstrates a simple technique for loading new rows into the Authors dimension. This example assumes that there is a primary key on the source system that we can use and it assumes that we do not have a change set.
Code Example 19.3
--Truncate any data that currently exists in the Authors_Temp table
TRUNCATE TABLE Authors_Temp
GO
--Load all of the data from the source system into the Authors_Temp table
INSERT INTO Authors_Temp
SELECT * FROM Authors
GO
--Set a starting value for the Contract field for two records
-- for use by future examples
UPDATE Authors_Temp
SET Contract = 0
WHERE state = 'UT'
GO
--Locate all of the new records that have been added to the source system by
--comparing the new temp table contents to the existing staging table contents
--and add the new records to the staging table
INSERT INTO Authors_Staging (au_id, au_lname, au_fname, phone, address, city, state,
zip, contract)
SELECT T.au_id, T.au_lname, T.au_fname, T.phone, T.address, T.city, T.state, T.zip,
T.contract
FROM Authors_Temp T LEFT OUTER JOIN
Authors_Staging S ON T.au_id = S.au_id
WHERE (S.au_id IS NULL)
GO
--Locate all of the new records that are to be added to the data warehouse
--and insert them into the data warehouse by comparing Authors_Staging to Authors_DW
INSERT INTO Authors_DW (Author_Key, au_id, au_lname, au_fname, phone, address, city,
state, zip, contract,
DateCreated, DateUpdated)
SELECT S.Author_Key, S.au_id, S.au_lname, S.au_fname, S.phone, S.address, S.city,
S.state, S.zip, S.contract,
S.DateCreated, S.DateUpdated
FROM Authors_Staging S LEFT OUTER JOIN
Authors_DW D ON S.au_id = D.au_id
WHERE (D.au_id IS NULL)
GO
Managing Slowly Changing Dimensions
This section describes various techniques for managing slowly changing dimensions in the data warehouse. "Slowly changing dimensions" is the customary term used for dimensions that contain attributes that, when changed, may affect grouping or summarization of historical data. Design approaches to dealing with the issues of slowly changing dimensions are commonly categorized into the following three change types:
•
|
Type 1: Overwrite the dimension record |
•
|
Type 2: Add a new dimension record |
•
|
Type 3: Create new fields in the dimension record |
Type 1 and Type 2 dimension change techniques are used when dimension attributes change in records that already exist in the data warehouse. The techniques for inserting new records into dimensions (discussed earlier in the section "Inserting New Dimension Records") apply to all dimensions regardless of whether changes to dimension attributes are incorporated using Type 1 or Type 2 change techniques.
The code examples in the following sections demonstrate techniques for managing Type 1 and Type 2 dimension changes. The examples have been kept simple to maintain clarity for technique illustration purposes—the examples assume that all changes for a dimension will be of the same type, whereas, in reality, most dimensions include some attributes that require Type 2 changes and other attributes that can be maintained using Type 1 changes. For example, a retailer may decide that a change in the marital status of a customer should be treated as a Type 2 change, whereas a change of street address for the same customer should be treated as a Type 1 change. Therefore, it is important to document all of the attributes in a dimension and, for each attribute, whether a value change should be applied as Type 1 or a Type 2 change.
Type 1: Overwrite the Dimension Record
A change to a dimension attribute that is never used for analysis can be managed by simply changing the data to the new value. This type of change is called a Type 1 change. For example, a change to a customer's street address is unlikely to affect any summarized information and the previous street address can be discarded without consequence.Type 1 dimension changes are straightforward to implement. The following Transact-SQL code demonstrates a simple Type 1 technique for updating existing rows in the Authors dimension. For this example, we will change some data in the Authors_Temp table to simulate changed records received as a result of updates to the authors table in the source database. The value for the Contract field is assumed to be eligible for Type 1 changes in this example. In a later section, the Contract field will be updated using a Type 2 change. The following example assumes that there is a primary key on the source system that we can use and it assumes that we do not have a change set:
Code Example 19.4
--Change the Authors_Temp table to simulate updates received from the source system
UPDATE Authors_Temp
SET Contract = 0
WHERE state = 'UT'
GO
--Update the Authors_Staging table with the new values in Authors_Temp
UPDATE Authors_Staging
SET Contract = T.Contract,
DateUpdated = getdate()
FROM Authors_Temp T INNER JOIN Authors_Staging S
ON T.au_id = S.au_id
WHERE T.Contract <> S.Contract
GO
--Update the Author_DW with the new data in Authors_Staging
UPDATE Authors_DW
SET Contract = S.Contract,
DateUpdated = getdate()
FROM Authors_Staging S INNER JOIN Authors_DW D
ON S.Author_Key = D.Author_Key
WHERE S.Contract <> D.Contract
GO
Type 2: Add a New Dimension Record
Type 2 changes cause history to be partitioned at the event that triggered the change. Data prior to the event continues to be summarized and analyzed as before; new data is summarized and analyzed in accordance with the new value of the data. The technique for implementing a Type 2 change is to keep the existing dimension record and add a new record that contains the updated data for the attribute or attributes that have changed. Values are copied from the existing record to the new record for all fields that have not changed. A new surrogate key value is created for the new record and the record is added to the dimension table. Fact records that apply to events subsequent to the Type 2 change must be related to the new dimension record.Although it is relatively straightforward to implement Type 2 change techniques in the ETL process to manage slowly changing dimensions, the data associated with a dimension member becomes fragmented as such changes are made. Data warehouse analysis and reporting tools must be capable of summarizing data correctly for dimensions that include Type 2 changes. To minimize unnecessary fragmentation, a Type 2 change should not be used if a Type 1 change is appropriate.
The techniques used to insert new records into a Type 2 dimension are the same as the ones used to insert new records into a Type 1 dimension. However, the techniques used to track updates to dimension records are different.
The following Transact-SQL code demonstrates a simple technique for applying Type 2 changes for existing rows in the Authors dimension. Unlike a Type 1 change, existing records are not updated in a Type 2 dimension. Instead, new records are added to the dimension to contain the changes to the source system records. In this example, the values of the contract field changed in the Type 1 example are changed to different values and we now assume the contract field is to be managed as a Type 2 change.
Notice that the Transact-SQL statement used to load updated records into the staging table is the same as the one used to insert new records into the staging table except that the predicate clause in the two statements differ. When loading new records, the WHERE clause uses the auth_id field to determine which records are new. When inserting records for Type 2 changes, the WHERE clause causes new records to be added when the value of the attribute of interest (contract) in the temporary table differs from the attribute value in the staging table.
Code Example 19.5
--Change the Authors_Temp table to simulate updates received from the source system
--This change reverses the change made in the Type 1 example by setting Contract to 1
UPDATE Authors_Temp
SET Contract = 1
WHERE state = 'UT'
GO
--For example purposes, make sure the staging table records have a different value
--for the contract field for the UT authors
UPDATE Authors_Staging
SET Contract = 0
WHERE state = 'UT'
GO
--Insert new records into the Staging Table for those records in the temp table
--that have a different value for the contract field
INSERT INTO Authors_Staging (au_id, au_lname, au_fname, phone, address, city, state,
zip, contract)
SELECT T.au_id, T.au_lname, T.au_fname, T.phone, T.address, T.city, T.state, T.zip,
T.contract
FROM Authors_Temp T
LEFT OUTER JOIN Authors_Staging S ON T.au_id = S.au_id
WHERE T.Contract <> S.Contract
GO
--Insert the new records into the data warehouse Table
INSERT INTO Authors_DW (Author_Key, au_id, au_lname, au_fname, phone, address, city,
state, zip, contract,
DateCreated, DateUpdated)
SELECT S.Author_Key, S.au_id, S.au_lname, S.au_fname, S.phone, S.address, S.city,
S.state, S.zip, S.contract,
S.DateCreated, S.DateUpdated
FROM Authors_Staging S LEFT OUTER JOIN
Authors_DW D ON S.Author_Key = D.Author_Key
WHERE (D.Author_Key IS NULL)
GO
Managing the Fact Table
After all dimension records have been loaded and updated, the fact table also must be loaded with new data. The fact table must be loaded after the dimension tables so the surrogate keys added to the dimension records during the ETL processes can be used as foreign keys in the fact table. This section demonstrates techniques for loading the fact table.For purposes of these examples, a table (Fact_Source) is created that simulates a data table in a source system from which fact data can be extracted. The Fact_Source table data is a combination of data found in the Sales and TitleAuthor tables in the pubs database.
The following table lists definitions of the tables created for use with the examples that follow.
Table name
|
Purpose
|
Fact_Source
|
A simulated
source data table that will be used in the example code. This table is a
combination of Sales and TitleAuthor in the pubs
database.
|
Fact_Temp
|
Receives data
imported from the source system.
|
Fact_Staging
|
Holds the fact
table data during transformation and surrogate key operations. Data is loaded
to the data warehouse fact table after ETL operations are complete.
|
Fact_DW
|
The fact table
in the data warehouse.
|
Titles_DW
|
A dimension
table for Titles to demonstrate the use of surrogate keys in the fact table.
|
Store_DW
|
A dimension
table for Stores to demonstrate the use of surrogate keys in the fact tables.
|
Several
key points about the structures of these tables should be noted:
•
|
There is no difference between the structures of the Fact_Source table and the Fact_Temp tables. This allows for the easiest method to extract data from the source system so that transformations on the data do not impact the source system. |
•
|
The Fact_Staging table is used to add the dimension surrogate keys to the fact table records. This table is also used to validate any data changes, convert any data types, and so on. |
•
|
The structures of the Fact_Staging and Fact_DW tables do not match. This is because the final fact table in the data warehouse does not store the original keys—just the surrogate keys. |
•
|
The fact table key is an identity column that is generated when the transformed data is loaded into the fact table. Since we will not be updating the records once they have been added to the Fact_DW table, there is no need to generate the key prior to the data load into the fact table. This is not how the key column is generated in dimension tables. As discussed above, the decision to use an identity key for a fact table depends on the complexity of the data warehouse schema and the performance of load and query operations; this example implements an identity key for the fact table. |
Code Example 19.6
--Create the simulated source data table
CREATE TABLE [Fact_Source] (
[stor_id] [char] (4) NOT NULL ,
[ord_num] [varchar] (20) NOT NULL ,
[ord_date] [datetime] NOT NULL ,
[qty] [smallint] NOT NULL ,
[payterms] [varchar] (12) NOT NULL ,
[title_id] [tid] NOT NULL
) ON [PRIMARY]
GO
--Create the example temporary source data table used in the ETL database
CREATE TABLE [Fact_Temp] (
[stor_id] [char] (4) NOT NULL ,
[ord_num] [varchar] (20) NOT NULL ,
[ord_date] [datetime] NOT NULL ,
[qty] [smallint] NOT NULL ,
[payterms] [varchar] (12) NOT NULL ,
[title_id] [tid] NOT NULL
) ON [PRIMARY]
GO
--Create the example fact staging table
CREATE TABLE [Fact_Staging] (
[stor_id] [char] (4) NOT NULL ,
[ord_num] [varchar] (20) NOT NULL ,
[ord_date] [datetime] NOT NULL ,
[qty] [smallint] NOT NULL DEFAULT (0),
[payterms] [varchar] (12) NOT NULL ,
[title_id] [tid] NOT NULL,
[Store_Key] [int] NOT NULL DEFAULT (0),
[Title_Key] [int] NOT NULL DEFAULT (0)
) ON [PRIMARY]
GO
--Create the example data warehouse fact table
CREATE TABLE [Fact_DW] (
[Store_Key] [int] NOT NULL DEFAULT (0),
[Title_Key] [int] NOT NULL DEFAULT (0),
[ord_num] [varchar] (20) NOT NULL ,
[ord_date] [datetime] NOT NULL ,
[qty] [smallint] NOT NULL DEFAULT (0),
[payterms] [varchar] (12) NOT NULL ,
[Fact_Key] [int] IDENTITY (1, 1) NOT NULL PRIMARY KEY CLUSTERED
) ON [PRIMARY]
GO
--Create the example titles dimension table
CREATE TABLE [Titles_DW] (
[title_id] [tid] NOT NULL,
[title] [varchar] (80) NOT NULL,
[type] [char] (12) DEFAULT ('UNDECIDED'),
[pub_id] [char] (4) NULL,
[price] [money] NULL,
[advance] [money] NULL,
[royalty] [int] NULL,
[ytd_sales] [int] NULL,
[notes] [varchar] (200) NULL,
[pubdate] [datetime] NOT NULL DEFAULT (getdate()),
[Title_Key] [int] NOT NULL IDENTITY (1,1) PRIMARY KEY CLUSTERED
) ON [PRIMARY]
GO
--Create the example stores dimension table
CREATE TABLE [Stores_DW] (
[stor_id] [char] (4) NOT NULL,
[stor_name] [varchar] (40) NULL,
[stor_address] [varchar] (40) NULL,
[city] [varchar] (20) NULL,
[state] [char] (2) NULL,
[zip] [char] (5) NULL,
[Store_Key] [int] IDENTITY (1,1) PRIMARY KEY CLUSTERED
) ON [PRIMARY]
GO
The following statements populate
the sample fact and dimension tables and provide a base set of data for the
remainder for the examples. The Fact_Temp and Fact_Source may
appear to be redundant, but Fact_Source is only used to simulate a
source table in an OLTP system. Code Example 19.7
--Load the simulated Fact_Source table with example data
INSERT INTO Fact_Source
SELECT S.*
FROM titleauthor TA INNER JOIN sales S ON TA.title_id = S.title_id
GO
--Load the Fact_Temp table with data from the Fact_Source table
INSERT INTO Fact_Temp
SELECT *
FROM Fact_Source
GO
--Load the example dimension for Titles
INSERT INTO Titles_DW
SELECT *
FROM Titles
GO
--Load the example dimension for Stores
INSERT INTO Stores_DW
SELECT *
FROM Stores
GO
This completes the preparation of
the sample data. The remaining examples demonstrate the tasks that prepare data
for loading and load it into the data warehouse fact table. The following code loads the Fact_Staging table. Notice that the Store_Key and Title_Key columns that are used for surrogate keys contain zeros when the data is first loaded into the staging table. This is because NULLs are not allowed in these columns. The prevention of NULLs allows for a very clean data load and it negates the need to do NULL logic checks in the ETL code or the reporting system. The zeros in the column also provide an easy mechanism for locating invalid data in the dimension and fact table data. If a zero appears in either column in the final fact table, then the ETL logic failed to handle a dimension attribute. It is good practice to always add a dimension record with zero key and assign it the description of "unknown." This helps preserve relational integrity in the data warehouse and allows reporting systems to display the invalid data, so that corrections can be made to the ETL logic or the source data.
Code Example 19.8
--Load the Fact_Staging table with data in the Fact_Temp table
INSERT INTO Fact_Staging (stor_id, ord_num, ord_date, qty, payterms, title_id,
Store_Key, Title_Key)
SELECT stor_id, ord_num, ord_date, qty, payterms, title_id, 0, 0
FROM Fact_Temp
GO
Now that the Fact_Staging
table is loaded, the surrogate keys can be updated. The techniques for updating
the surrogate keys in the fact table will differ depending on whether the
dimension contains Type 2 changes. The following technique can be used for Type
1 dimensions: Code Example 19.9
--Update the Fact_Staging table with the surrogate key for Titles
--(Type 1 dimension)
UPDATE Fact_Staging
SET Title_Key = T.Title_Key
FROM Fact_Staging F INNER JOIN
Titles_DW T ON F.title_id = T.title_id
GO
--Update the Fact_Staging table with the surrogate key for Store
--(Type 1 dimension)
UPDATE Fact_Staging
SET Store_Key = S.Store_Key
FROM Fact_Staging F INNER JOIN
Stores_DW S ON F.Stor_id = S.Stor_id
GO
The technique above will not work
for dimensions that contain Type 2 changes, however, because there may be more
than one dimension record that contains the original source key. The following
technique is appropriate for Type 2 dimensions: Code Example 19.10
--Add a few new rows to the Stores_DW table to demonstrate technique
--Duplicate Store records are added that reflect changed store names
INSERT INTO Stores_DW (stor_id, stor_name, stor_address, city, state, zip)
SELECT stor_id, 'New ' + stor_name, stor_address, city, state, zip
FROM Stores_DW
WHERE state = 'WA'
GO
--Add some new rows to fact table to demonstrate technique
INSERT INTO Fact_Staging (stor_id, ord_num, ord_date, qty, payterms, title_id,
Store_Key, Title_Key)
SELECT stor_id, ord_num, ord_date, qty, payterms, title_id, 0, 0
FROM Fact_Temp
GO
--Update the fact table. Use the maximum store key
--to relate the new fact data to the latest store record.
BEGIN TRANSACTION
--get the maximum store_key for each stor_id
SELECT MAX(STORE_KEY) AS Store_Key, stor_id
INTO #Stores
FROM Stores_DW
GROUP BY stor_id
ORDER BY stor_id
--update the fact table
UPDATE Fact_Staging
SET Store_Key = S.Store_Key
FROM Fact_Staging F INNER JOIN
#Stores S ON F.stor_id = S.stor_id
WHERE F.Store_Key = 0
--drop the temporary table
DROP TABLE #Stores
COMMIT TRANSACTION
GO
After the fact data has been
successfully scrubbed and transformed, it needs to be loaded into the data
warehouse. If the ETL database is not on the same server as the data warehouse
database, then the data will need to be transferred using DTS, bcp, or
another mechanism. An efficient approach is to use bcp to export the
data from the ETL database, copy the data to the target server, and then use
BULK INSERT to update the target database. However, if the databases are on the
same server, a simple INSERT statement will load the new fact table rows: Code Example 19.11
--Load the new fact table rows into the data warehouse
INSERT INTO Fact_DW (ord_num, ord_date, qty, payterms, Store_Key, Title_Key)
SELECT ord_num, ord_date, qty, payterms, Store_Key, Title_Key
FROM Fact_Staging
GO
Finally, the following SELECT
statement shows the data warehouse fact table, complete with Type 2 dimension
for the stores dimension: Code Example 19.12
--Demonstrate the success of the technique
SELECT S.stor_id, S.Store_Key, S.stor_name, F.ord_num, F.ord_date, F.qty, F.payterms
FROM Stores_DW S INNER JOIN
Fact_DW F ON S.Store_Key = F.Store_Key
ORDER BY S.stor_id, S.Store_Key
GO
Advanced Techniques
While the sample techniques described above will work for small to medium-sized dimensions, they will not work for large dimensions. For large dimensions, variations of these techniques can provide greater efficiency. The code examples in this topic show some advanced techniques for Type 2 dimensions.One of the key design decisions in the above techniques centers on the use of the staging tables. In the techniques illustrated above, the staging tables are exact copies of the final data warehouse dimension tables. However, the efficiency of the above techniques decreases as the number of rows in the dimension increase due to records added for Type 2 changes. For very large dimensions (millions of rows), the above technique will require massive amounts of processing power to complete. Therefore, for large dimensions, we need to introduce a variation of the above technique that will allow the system to scale with the data warehouse.
This variation involves creating a "current version" dimension table for use in the ETL process that contains only a single row for each of the dimension members. This record contains the current attributes of the dimension member. For example, if we have a Type 2 dimension for stores, and the data for the store Bookbeat has undergone three Type 2 changes, then the current version table would not contain all four records for the store. Instead, the table contains a single row for Bookbeat that contains all of the current information for it, including the current surrogate key value for the dimension member. This creates a smaller table with fewer rows that allows for faster access during the ETL process.
The following code incorporates a Store_Current table to demonstrate this technique for the Stores dimension. The table below describes each of the tables used in the example.
Table name
|
Purpose
|
AuthorsStores_Temp
|
Holds the data
imported from the source system.
|
Stores_Staging
|
Holds the
dimension data while it is being updated. The data for the stores will be
updated in this table and then the data will be loaded into the data
warehouse dimension table.
|
Stores_Current
|
Contains a
single record for each store to track the current information for the store.
|
Stores_DW
|
Simulates the
Stores dimension table in the data warehouse.
|
The
following statements create the four tables:
Code Example 19.13 DROP TABLE Stores_DW
GO
CREATE TABLE [Stores_Temp] (
[stor_id] [char] (4) NOT NULL,
[stor_name] [varchar] (40) NULL,
[stor_address] [varchar] (40) NULL,
[city] [varchar] (20) NULL,
[state] [char] (2) NULL,
[zip] [char] (5) NULL
) ON [PRIMARY]
GO
CREATE TABLE [Stores_Staging] (
[stor_id] [char] (4) NOT NULL,
[stor_name] [varchar] (40) NULL,
[stor_address] [varchar] (40) NULL,
[city] [varchar] (20) NULL,
[state] [char] (2) NULL,
[zip] [char] (5) NULL,
[DateCreated] smalldatetime NOT NULL DEFAULT (getdate()),
[DateUpdated] smalldatetime NOT NULL DEFAULT (getdate()),
[Store_Key] [int] IDENTITY (1,1) PRIMARY KEY CLUSTERED
) ON [PRIMARY]
GO
CREATE TABLE [Stores_Current] (
[stor_id] [char] (4) NOT NULL,
[stor_name] [varchar] (40) NULL,
[stor_address] [varchar] (40) NULL,
[city] [varchar] (20) NULL,
[state] [char] (2) NULL,
[zip] [char] (5) NULL,
[DateCreated] smalldatetime NOT NULL DEFAULT (getdate()),
[DateUpdated] smalldatetime NOT NULL DEFAULT (getdate()),
[Store_Key] [int] PRIMARY KEY CLUSTERED
) ON [PRIMARY]
GO
CREATE TABLE [Stores_DW] (
[stor_id] [char] (4) NOT NULL,
[stor_name] [varchar] (40) NULL,
[stor_address] [varchar] (40) NULL,
[city] [varchar] (20) NULL,
[state] [char] (2) NULL,
[zip] [char] (5) NULL,
[DateCreated] smalldatetime NOT NULL DEFAULT (getdate()),
[DateUpdated] smalldatetime NOT NULL DEFAULT (getdate()),
[Store_Key] [int] PRIMARY KEY CLUSTERED
) ON [PRIMARY]
GO
The following statements populate
the Stores_Temp, Stores_Staging, and Stores_Current sample
tables to provide a base set of data that will be used in the remainder of the
example: Code Example 19.14
--Load the Stores_Temp table with the default set of data
INSERT INTO Stores_Temp
SELECT * FROM Stores
GO
--Load the Stores_Staging table with the default set of data
INSERT INTO Stores_Staging (stor_id, stor_name, stor_address, city, state, zip,
DateCreated, DateUpdated)
SELECT stor_id, stor_name, stor_address, city, state, zip, getdate(), getdate()
FROM Stores_Temp
GO
--Load the Stores_Current table with the default set of data
INSERT INTO Stores_Current (stor_id, stor_name, stor_address, city, state, zip,
DateCreated, DateUpdated, store_key)
SELECT stor_id, stor_name, stor_address, city, state, zip, DateCreated, DateUpdated,
store_key
FROM Stores_Staging
GO
The following code adds some new
records into the Stores_Staging table to simulate Type 2 changes to the
Stores dimension. The new records reflect changes to existing store data; no
new store records are added. Code Example 19.15
--Insert some change records into Store_Staging to demonstrate the technique
--Duplicate records are added that reflect changes to store names for some stores
INSERT INTO Stores_Staging (stor_id, stor_name, stor_address, city, state, zip)
SELECT stor_id, stor_name + ' New', stor_address, city, state, zip
FROM Stores_staging
WHERE state <> 'ca'
GO
Records for new stores are loaded
into Stores_Current before starting to process stores with change
records. The following Transact-SQL code loads new stores in the Stores_Staging
table into the Stores_Current table. This technique is the same as the
one documented earlier in the chapter. Code Example 19.16
--Insert any new stores in Stores_Staging into the Stores_Current table
--In this example there should not be any new stores
INSERT INTO Stores_Current (stor_id, stor_name, stor_address, city, state, zip,
DateCreated, DateUpdated)
SELECT S.stor_id, S.stor_name, S.stor_address, S.city, S.state, S.zip, S.DateCreated,
S.DateUpdated
FROM Stores_Staging S LEFT OUTER JOIN Stores_Current C ON S.stor_id = C.stor_id
WHERE (c.Store_Key IS NULL)
GO
The real change in this technique
involves changing the way that dimension members are updated. The following
Transact-SQL code demonstrates the alternative way to update the dimension
members and load them into the data warehouse dimension. Once the new members
of the dimension have been loaded, the next step is to check existing members
for attributes changes that require Type 2 changes to the dimension. This example
checks the stor_name attribute and updates the row in the Stores_Current
table for every store that has had a name change (in this example, all stores
that do not exist in CA). Code Example 19.17
--Update Store_Current table for all stores that have had a name change
UPDATE Stores_Current
SET stor_name = S.stor_name,
Store_key = S.Store_Key,
DateUpdated = getdate()
FROM Stores_Staging S
INNER JOIN Stores_Current C ON S.stor_id = C.stor_id
WHERE S.stor_name <> C.stor_name
GO
Now that all of the dimension
records have been updated with the latest data, the surrogate keys can be
updated for the fact table data with the following Transact-SQL statement. This
technique is more efficient because a temporary table does not have to be
created to determine the current value of the dimension table key. Code Example 19.18
--generate some fact data rows that do not have a store_key
INSERT INTO Fact_Staging (stor_id, ord_num, ord_date, qty, payterms, title_id,
store_key, title_key)
SELECT stor_id, ord_num, getdate(), qty, payterms, title_id, 0, 0
FROM Fact_Staging
WHERE QTY < 20
GO
--Update the fact data using the Store_Key key from the Store_Current table
--to relate the new fact data to the latest store record
UPDATE Fact_Staging
SET Store_Key = C.Store_Key
FROM Fact_Staging F INNER JOIN
Stores_Current C ON F.stor_id = C.stor_id
WHERE F.Store_Key = 0
GO
Meta Data Logging
A critical design element in successful ETL implementation is the capability to generate, store and review meta data. Data tables in a data warehouse store information about customers, items purchased, dates of purchase, and so on. Meta data tables store information about users, query execution times, number of rows retrieved in a report, etc. In ETL systems, meta data tables store information about transformation execution time, number of rows processed by a transformation, the last date and time a table was updated, failure of a transformation to complete, and so on. This information, if analyzed appropriately, can help predict what is likely to occur in future transformations by analyzing trends of what has already occurred.In the code examples that follow, the terms "Job" and "Step" are used with the following meanings:
•
|
A "Job" is an ETL element that is either executed manually or as a scheduled event. A Job contains one or more steps. |
•
|
A "Step" is an individual unit of work in a job such as an INSERT, UPDATE, or DELETE operation. |
•
|
A "Threshold" is a range of values defined by a minimum value and a maximum value. Any value that falls within the specified range is deemed acceptable. Any value that does not fall within the range is unacceptable. For example, a processing window is a type of threshold. If a job completes within the time allotted for the processing window, then it is acceptable. If it does not, then it is not acceptable. |
Job Audit
ETL jobs produce data points that need to be collected. Most of these data points are aggregates of the data collected for the job steps and could theoretically be derived by querying the job step audit table. However, the meta data for the job itself is important enough to warrant storage in a separate table. Below are sample meta data tables that aid in tracking job information for each step in an ETL process.tblAdmin_Job_Master
This table lists all of the jobs that are used to populate the data warehouse. These are the fields in tblAdmin_Job_Master:
Field
|
Definition
|
JobNumber
|
A unique
identifier for the record, generally an identity column.
|
JobName
|
The name
(description) for the job. For example, "Load new dimension data."
|
MinThreshRecords
|
The minimum
acceptable number of records affected by the job.
|
MaxThreshRecords
|
The maximum acceptable
number of records affected by the job.
|
MinThreshTime
|
The minimum
acceptable execution time for the job.
|
MaxThreshTime
|
The maximum
acceptable execution time for the job.
|
CreateDate
|
The date and
time the record was created.
|
tblAdmin_Audit_Jobs
This table is used to track each
specific execution of a job. It is related to the tblAdmin_Job_Master
table using the JobNumber column. These are the fields in tblAdmin_Audit_Jobs:
Field
|
Definition
|
JobNumber
|
A unique
identifier for the record, generally an identity column.
|
JobName
|
The name
(description) for the job. For example, "Load new dimension data."
|
StartDate
|
The date and
time the job was started.
|
EndDate
|
The date and
time the job ended.
|
NumberRecords
|
The number of
records affected by the job.
|
Successful
|
A
flag-indicating if the execution of the job was successful.
|
This data
definition language will generate the above audit tables:
Code Example 19.19 CREATE TABLE [dbo].[tblAdmin_Job_Master] (
[JobNumber] [int] IDENTITY (1, 1) NOT NULL
CONSTRAINT UPKCL_Job PRIMARY KEY CLUSTERED,
[JobName] [varchar] (50) NULL DEFAULT ('Missing'),
[MinThreshRecords] [int] NOT NULL DEFAULT (0),
[MaxThreshRecords] [int] NOT NULL DEFAULT (0),
[MinThreshTime] [int] NOT NULL DEFAULT (0),
[MaxThreshTime] [int] NOT NULL DEFAULT (0)
GO
CREATE TABLE [dbo].[tblAdmin_Audit_Jobs] (
[JobNumber] [int] IDENTITY (1, 1) NOT NULL
CONSTRAINT UPKCL_Job PRIMARY KEY CLUSTERED,
[JobName] [varchar] (50) NULL DEFAULT ('Missing'),
[StartDate] [smalldatetime] NOT NULL DEFAULT (getdate()),
[EndDate] [smalldatetime] NOT NULL DEFAULT ('01/01/1900'),
[NumberRecords] [int] NOT NULL DEFAULT (0),
[Successful] [bit] NOT NULL DEFAULT (0),
GO
Step Audit
Many ETL jobs are multi-step, complicated transformations that involve INSERT, UPDATE, and DELETE statements with branching logic and an execution dependency. It is important to record meta data that tracks the successful completion of an operation, when it happened and how many rows it processed. This information should be stored for every step in an ETL job. Below are sample meta data tables that aid in tracking information for each step in an ETL job.tblAdmin_Step_Master
This table lists all of the steps in a job. These are the fields in tblAdmin_Step_Master:
Field
|
Definition
|
JobNumber
|
The unique
number of the job that this step is associated with.
|
StepSeqNumber
|
The step
number within the object that executed the unit of work. Frequently, ETL jobs
contain more than a single unit of work and storing the step number allows
for easy debugging and specific reporting. If the object only has a single
step, then the value of this field is "1".
|
StepDescription
|
A description
of the step. For example, "Inserted records into tblA."
|
Object
|
The name of
the object. For example, the name of a stored procedure or DTS package that
accomplishes the step.
|
NumberRecords
|
The number of
records affected by the step.
|
MinThreshRecords
|
The minimum
"acceptable" number of records affected by the step.
|
MaxThreshRecords
|
The maximum
"acceptable" number of records affected by the step.
|
MinThreshTime
|
The minimum
"acceptable" execution time for the step.
|
MaxThreshTime
|
The maximum
"acceptable" execution time for the step.
|
CreateDate
|
The date and
time the record was created.
|
StepNumber
|
A unique value
assigned to the record, generally an identity column.
|
tbl_Admin_Audit_Step
This table is used to track each
specific execution of a job step. It is related to the tblAdmin_Step_Master
table using the StepNumber column. These are the fields in tblAdmin_Audit_Step:
Field
|
Definition
|
RecordID
|
A unique value
assigned to the record, generally an identity column.
|
JobAuditID
|
Used to tie
the specific execution on a job step to the specific execution of a job.
|
StepNumber
|
The step
number executed.
|
Parameters
|
Any parameters
sent to the job step for the specific execution instance. These are the
parameter values, not a list of the parameters.
|
NumberRecords
|
The number of
records affected by the job step.
|
StartDate
|
The date and
time the job step was started.
|
EndDate
|
The date and
time the job step ended.
|
UserName
|
The name of
the user that executed the job step.
|
Below is
the data definition language that will generate the job step audit tables
above.
Code Example 19.20 CREATE TABLE [dbo].[tblAdmin_Step_Master] (
[JobNumber] [int] NOT NULL DEFAULT (1),
[StepSeqNumber] [int] NOT NULL DEFAULT (1),
[StepDescription] [varchar] NOT NULL DEFAULT ('Missing'),
[Object] [varchar] (50) NULL DEFAULT ('Missing'),
[MinThreshRecords] [int] NOT NULL DEFAULT (0),
[MaxThreshRecords] [int] NOT NULL DEFAULT (0),
[MinThreshTime] [int] NOT NULL DEFAULT (0),
[MaxThreshTime] [int] NOT NULL DEFAULT (0),
[StartDate] [smalldatetime] NOT NULL DEFAULT (getdate()),
[StepNumber] [int] IDENTITY (1, 1) NOT NULL
CONSTRAINT UPKCL_PrimaryKey PRIMARY KEY CLUSTERED) ON [PRIMARY]
GO
CREATE TABLE [dbo].[tblAdmin_Audit_Step] (
[RecordID] [int] IDENTITY (1, 1) NOT NULL
CONSTRAINT UPKCL_PrimaryKey PRIMARY KEY CLUSTERED,
[Object] [varchar] (50) NULL DEFAULT ('Missing'),
[StepNumber] [tinyint] NOT NULL DEFAULT (1),
[StepDescription] [varchar] NOT NULL DEFAULT ('Missing'),
[Parameters] [varchar] (100) NULL,
[NumberRecords] [int] NOT NULL DEFAULT (1),
[StartDate] [smalldatetime] NOT NULL DEFAULT (getdate()),
[EndDate] [smalldatetime] NOT NULL DEFAULT ('01/01/1900'),
[UserName] [varchar] (20) NOT NULL DEFAULT ('Missing')) ON [PRIMARY]
GO
Error Tracking
Another important type of meta data about transformations is information that tracks what failed and why. ETL jobs produce errors. Just as tracking successful execution is important, tracking failures is equally important. Below is a sample meta data table that aids in tracking error information for each step in an ETL job. This table is designed to track SQL Server 2000 errors, although it could be modified to track OLE DB errors as well.Note In SQL Server 2000, only the error number can be trapped, not the generated error message.
tblAdmin_Audit_Errors
This table lists all of the errors that are generated during a job step. These are the fields in tblAdmin_Audit_Errors:
Field
|
Definition
|
RecordID
|
A unique value
assigned to the record, generally an identity column.
|
StepNumber
|
The step
number executed that generated the error.
|
Parameters
|
Any parameters
sent to the job step for the specific execution instance. These are the parameter
values, not a list of the parameters.
|
ErrorNumber
|
The error
number raised by SQL Server 2000 (generally the @@ERROR number).
|
RecordCount
|
The number of
records affected by the job step.
|
UserName
|
The name of
the user that executed the job step.
|
EndDate
|
The date and
time the job step ended.
|
Code Sample: Job Audit
The following stored procedures demonstrate one method of logging job level meta data. The usp_Admin_Audit_Job_Start procedure indicates the start of an ETL job and should be the very first stored procedure executed in the ETL job:Code Example 19.21
ALTER PROCEDURE usp_Admin_Audit_Job_Start
@JobNumber int = 1 --The number of the job being executed (from the mast job table)
AS
SET NOCOUNT ON --SET NoCount ON
--DECLARE variables
DECLARE @ErrorNumber int --the number of the SQL error generated
DECLARE @ErrorRowCount int --the number of rows in the unit of work affected by
error
DECLARE @Startdate smalldatetime --the datetime the load job started
DECLARE @EndDate smalldatetime --the datetime the load job ended
--INSERT the first record (start time) for the job into the tblAdmin_Audit_Jobs table
BEGIN TRANSACTION
SET @StartDate = getdate() --set a start date for the batch
SET @EndDate = '01/01/1900' --set a bogus endate for the batch
insert into tblAdmin_Audit_Jobs (JobNumber, StartDate, EndDate, NumberRecords,
Successful)
values (@JobNumber, @StartDate, @EndDate, 0, 0)
SELECT @ErrorNumber = @@error, @ErrorRowCount = @@rowcount
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_AdminError @@ProcID, 'none', @ErrorNumber, @ErrorRowCount
RETURN (1)
GO
The following stored procedure
indicates the end of an ETL job and should be the last stored procedure
executed in the ETL job. It is important to note that in addition to updating
the tblAdmin_Audit_Jobs table, this stored procedure also updates the tblAdmin_Audit_Step
table with the threshold information for each step. The threshold information
is stored with each step in the table because over time, the acceptable
thresholds for the step may change. If the threshold information is only stored
in the master step table (a Type 1 dimension), any changes to the table affect
meta data generated for historical steps. Therefore, storing the threshold with the step (a Type 2 dimension) allows us to maintain historical execution records without affecting their integrity if the master step information is changed. For example, if a step initially loads 1,000 rows but over time the number of rows increases to 1 million, the acceptable threshold information for that step must be changed as well. If the threshold data is stored only in the tblAdmin_Step_Master table and not stored with each record, the context of the data will be lost, which can cause inaccuracies in reports built on the meta data information. For simplicity, to illustrate the technique, the sample code does not maintain threshold information automatically. In order to change the threshold information for a step, an administrator will need to modify the master step record manually. However, it would be possible to automate this process.
Code Example 19.22
CREATE PROCEDURE usp_Admin_Audit_Job_End
@JobNumber int = 1, --The number of the job (from the master job table) being
executed
@Successful bit --A flag indicating if the job was successful
AS
SET NOCOUNT ON --SET NoCount ON
--DECLARE variables
DECLARE @ErrorNumber int --the number of the SQL error generated
DECLARE @ErrorRowCount int --the number of rows in the unit of work affected by
error
DECLARE @Startdate smalldatetime --the datetime the load job started
DECLARE @EndDate smalldatetime --the datetime the load job ended
DECLARE @JobAuditID int --the # for the instance of the job
DECLARE @RowCount int --the number of rows affected by the job
--UPDATE the job record (end time) in the Audit table
BEGIN TRANSACTION
SET @EndDate = getdate() --set the end date for the batch
SET @JobAuditID = (SELECT MAX(JobAuditID) FROM tblAdmin_Audit_Jobs
where JobNumber = @JobNumber) --get the job number
SET @RowCount = (SELECT SUM(NumberRecords) --get the total job record count
FROM tblAdmin_Audit_Step WHERE JobAuditID = @JobAuditID)
UPDATE tblAdmin_Audit_Jobs --Update the Job record with the end time
SET EndDate = @EndDate,
NumberRecords = @RowCount,
Successful = 1
WHERE JobAuditID = @JobAuditID
SELECT @ErrorNumber = @@error, @ErrorRowCount = @@rowcount
UPDATE tblAdmin_Audit_Step --Update all steps for the job with the
SET MinRecords = T.MinThreshRecords, --threshold information for each step
MaxRecords = T.MaxThreshRecords,
MinTime = T.MinThreshTime,
MaxTime = T.MaxThreshTime,
TimeTarget = CASE
WHEN DATEDIFF(mi, A.StartDate, A.EndDate) BETWEEN T.MinThreshTime AND T.MaxThreshTime
THEN 'On'
WHEN DATEDIFF(mi, A.StartDate, A.EndDate)< T.MinThreshTime THEN 'Under'
WHEN DATEDIFF(mi, A.StartDate, A.EndDate) > T.MaxThreshTime THEN 'Over'
ELSE 'Unknown'
END,
RecordTarget = CASE
WHEN A.NumberRecords BETWEEN T.MinThreshRecords AND T.MaxThreshRecords THEN 'On'
WHEN A.NumberRecords < T.MinThreshRecords THEN 'Under'
WHEN A.NumberRecords > T.MaxThreshRecords THEN 'Over'
ELSE 'Unknown'
END
FROM tblAdmin_Step_Master T
RIGHT OUTER JOIN tblAdmin_Audit_Step A ON T.StepNumber = A.StepNumber
WHERE A.JobAuditID = @JobAuditID
SELECT @ErrorNumber = @@error, @ErrorRowCount = @@rowcount
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_AdminError @@ProcID, 'none', @ErrorNumber, @ErrorRowCount
RETURN (1)
GO
Code Sample: Step Audit
The following stored procedures demonstrate one method of logging step records from within ETL stored procedures. Notice that the @@ProcID is used to retrieve the object id of the executing stored procedure. Also note that the values of @@error and @@rowcount are retrieved immediately after the INSERT statement.Code Example 19.23
ALTER PROCEDURE usp_Admin_Audit_Step
@StepNumber tinyint = 0, --the uniue number of the step
@Parameters varchar(50) = 'none', --any parameters used in the SP
@RecordCount int = 0, --the number of records modified by the step
@StartDate smalldatetime, --the date & time the step started
@EndDate smalldatetime --the date & time the step ended
AS
SET NOCOUNT ON --SET NoCount ON
--DECLARE variables
DECLARE @ErrorNumber int
DECLARE @ErrorRowCount int
DECLARE @JobAuditID int
BEGIN TRANSACTION --INSERT the audit record into the tblAdmin_Audit_Step table
SET @JobAuditID = (SELECT MAX(JobAuditID) FROM tblAdmin_Audit_Jobs)
INSERT INTO tblAdmin_Audit_Step ( JobAuditID, StepNumber, Parameters, NumberRecords,
StartDate, EndDate, Username)
VALUES (@JobAuditID, @StepNumber, @Parameters, @RecordCount,
@StartDate, @EndDate, user_name())
SELECT @ErrorNumber = @@error, @ErrorRowCount = @@RowCount
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_Admin_Log_Error @@ProcID, 1, 'none', @ErrorNumber, @ErrorRowCount
RETURN (1)
GO
The following stored procedure
demonstrates the use of the auditing stored procedure detailed above: Code Example 19.24
CREATE PROCEDURE usp_AuditSample
AS
SET NOCOUNT ON --SET NoCount ON
--DECLARE variables
DECLARE @ErrorNumber int
DECLARE @RecordCount int
DECLARE @StartDate smalldatetime
DECLARE @EndDate smalldatetime
BEGIN TRANSACTION
SET @StartDate = getdate() --get the datetime the step started
insert into tblTest
select * from tblTest
SELECT @ErrorNumber = @@error, @RecordCount = @@rowcount
If @ErrorNumber <> 0 --error handler
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
SET @EndDate = getdate() --get the datetime the step finished
--log the audit record into the tblAdmin_Audit_Step table
exec usp_Admin_Audit_Step 1 , 'test from SP', @RecordCount, @StartDate, @EndDate
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_Admin_Log_Error @@ProcID, 'none', @ErrorNumber, @RecordCount
RETURN (1)
GO
Code Sample: Error Tracking
The following stored procedures demonstrate one possible method of logging errors in ETL stored procedures. Notice that the stored procedure uses the OBJECT_NAME function to retrieve the name of the object (table, view, stored procedure, and so on). This introduces a level of abstraction so that the code is only useful for stored procedures.Code Example 19.25
CREATE PROCEDURE usp_Admin_Log_Error
@ObjectID int,
@StepNumber int,
@Parameters varchar(50) = 'none',
@ErrorNumber int = 0,
@RecordCount int = 0
AS
--SET NoCount ON
SET NOCOUNT ON
--RETRIEVE the NAME of the object being audited
DECLARE @ObjectName varchar(50)
SET @ObjectName = OBJECT_NAME(@ObjectID)
--INSERT the audit record into the tblAdmin_Audit_Errors table
BEGIN TRANSACTION
insert into tblAdmin_Audit_Errors (Object, StepNumber, Parameters, ErrorNumber,
RecordCount, UserName, RecordDate)
values (@ObjectName, @StepNumber, @Parameters, @ErrorNumber, @RecordCount,
user_name(), getdate())
COMMIT TRANSACTION
GO
Once an error is generated and
passed to the error logging stored procedure, it is logged to the tblAdmin_Audit_Errors
table. Notice that the @@ProcID is used to retrieve the objected of the
executing stored procedure. Also note that the values of @@error and @@rowcount
are retrieved immediately after the INSERT statement. With the exception of
modifying the value of @Step, this logic may be deployed with no other
alterations to the code. The following stored procedure demonstrates how to
deploy the error logging method detailed above: Code Example 19.26
CREATE PROCEDURE usp_ErrorSample
AS
--SET NoCount ON
SET NOCOUNT ON
--DECLARE Variables
DECLARE @ObjectName varchar(50)
DECLARE @ErrorNumber int, @RecordCount int
DECLARE @Step int
--INSERT the audit record into the Authors table
BEGIN TRANSACTION
insert into Authors
Select * from authors
SELECT @ErrorNumber = @@error, @RecordCount = @@rowcount, @Step = 2
If @ErrorNumber <> 0
BEGIN
ROLLBACK TRANSACTION
GOTO Err_Handler
END
COMMIT TRANSACTION
RETURN (0)
Err_Handler:
exec usp_Admin_Log_Error @@ProcID, @Step, 'none', @ErrorNumber, @RecordCount
RETURN (1)
GO
Conclusion
The ETL system efficiently extracts data from its sources, transforms and sometimes aggregates data to match the target data warehouse schema, and loads the transformed data into the data warehouse database. A well-designed ETL system supports automated operation that informs operators of errors with the appropriate level of warning. SQL Server 2000 Data Transformation 1048527521Services can be used to manage the ETL operations, regardless of the techniques used to implement individual ETL tasks.
While it is tempting to perform some transformation on data as it is extracted from the source system, the best practice is to isolate transformations within the transformation modules. In general, the data extraction code should be designed to minimize the impact on the source system databases.
In most applications, the key to efficient transformation is to use a SQL Server 2000 database for staging. Once extracted data has been loaded into a staging database, the powerful SQL Server 2000 database engine is used to perform complex transformations.
The process of loading fact table data from the staging area into the target data warehouse should use bulk load techniques. Dimension table data is usually small in volume, which makes bulk loading less important for dimension table loading.
The ETL system is a primary source of meta data that can be used to track information about the operation and performance of the data warehouse as well as the ETL processes.
excellent and very comprehensive. my ceo just wrote a post about why ETL is important, and your post details exactly why. great job.
ReplyDeletehttp://www.xplenty.com/blog/2013/06/etl-is-it-still-relevant/