Kimball ETL (subsystem 1) – Data Profiling via SSIS Data Flow – Pipeline Column Data Profiling
In a recent article from the Kimball University: The Subsystems of ETL Revisited, published on October 21, 2007 in Intelligent Enterprise, by Bob Becker of the Kimball University states
“Kimball University: The Subsystems of ETL Revisited:These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.”
In particular the first sub system he describes is Data Profiling.
“Data Profiling (subsystem 1) — Explores a data source to determine its fit for inclusion as a source and the associated cleaning and conforming requirements.”
In this article and several others to follow I will cover the SSIS techniques and available products that can be utilized to support the “The Subsystems of ETL” as defined by the Kimball University via SSIS.
This topic is also cover in Tod McKenna’s, blog.
In the interest of full disclosure, one of the solutions I cover will be from a company I have an interest in Actuality Business Intelligence.
A brief review of the SSIS “Pipeline” Architecture
The primary advantage of the SSIS architecture is the “pipeline”. The objective is for the developer to pass the columns needed for transformation and output through the pipeline and to reduce the staging of columns in tables. Secondly a major feature of “pipeline” processing is that as a developer you can use transforms to split and proactively redirect records and take actions on the fly and in memory, aka “pipeline”.
In order to leverage the “pipeline” a method of analyzing columns and there values is needed. Column analysis is usually accomplished by using data profiling tools, hard coded queries or as provided in SQL 2008 a Data Profiling Control Flow. However these methods produce tables or reports and are not easily available in an SSIS Data Flow.
SSIS Data Profiling using supplied Data Flow Components
I have used a script component in the past and accomplished some of the “pipeline” profiling techniques. See SQLServerrCentral post from May 2007, Data Profiling in SSIS.
In many cases a developer needs to know the characteristics or “profile” of a column and be able to take actions on records (cleaning, rejecting, assign error) in a proactive manner. Meaning that as you process new inputs through your data flow over time you will encounter new errors for specific columns, and will want to ensure the package runs and an identify any new anomalies in records and processes them successfully. In the example I provided last year I demonstrated using the Script Component to identify value patterns (Column Domains) and using the conditional Split Transform route the Valid and Invalid records appropriately. I have included the code below and the entire package is available from Data Profiling in SSIS.
Script Component Column Profiling
Invalid Records Conditional Split
Note the Script Component has identified the specific patterns for the Phone column and some of the invlalid vales are now “tagged” and can be rerouted or acted on in some other way.
Valid Records from Condition Split
The valid records can continue on through the pipeline.
Data Profiling via Script component
‘ Microsoft SQL Server Integration Services user script component
Public Class ScriptMain
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Row.EmailLength = Len(Row.EmailAddress)
Pattern Summary from Aggregate Transform
In addition we have provided an output for a report summarizing the Script Component Pattern Profiling results.
Reroute records through Condition Split
The goal is to reroute records in the “pipeline” proactively into the appropriate stream. In this case Valid and Invalid Destinations.
However this requires manually coding the required actions and is not easily portable or reusable.
In addition, some of the shortfalls of this solution involve have to modify the Input and Output columns each time you want to process a new input.
So at best our solution via the Script Component is helpful, but a “one off” effort, with much custom work required in order to use theses techniques in other packages.
SSIS Data Profiling using off the shelf Data Flow Component
An obvious solution would for Microsoft to provide discrete data profiling Data Flow Transforms or perhaps an industrious ISV.
A solution now exists, in the Data Flows from Actuality Business Intelligence. I will discuss other vendors in future articles. To date no other vendors have provided in line Data Profiling in the form of a Data Transformation as opposes to a Control Flow.
Now let’s examine implementing the same solution, with custom ABI Data Flow components.
We will use three Data Flow custom components as follows:
ABI Column Profiler –
Calculates profiling statistics for a specific column in the pipeline. Min, Max , Null count etc…
ABI Column Pattern –
Determines the pattern for a specific column in the pipeline, such as 999-99-999 for a US Phone number or 999-99-9999 for a SSN and make is available to downstream transformations.
ABI Column Domain Pattern –
Determines the of distinct values for a specific column and the counts for each vales, as well as, the associated pattern. For SSN:
In this article we will examine the Column Profiler in detail. Additional ABI, components as well as others will be covered in future articles.
SSIS ABI Column Profiler
Once you have installed the ABI SSIS Column Data Profile you can select it from the Toolbox.
SSIS ABI Column Profiler – Column Selection
Once you have connected a Source to the component, a very simple interface is provided to select the columns to profile or pass through.
Available Output Columns are as follows:
The output columns can be routed to any other Transform or Data Flow Destination.
In addition a table, file or xml output can be created to support or maintain a custom metadata repository or create a structure to support a data driven application.
DDL Created from SSIS Column Profiler:
CREATE TABLE [ColumnProfileStats] (
The source code for the Script Component solution can be obtained from Data Profiling in SSIS on JumpStartTV. The custom column profiling component can be downloaded from Actuality Business Intelligence.
Ira W. Whiteside