Kimball ETL (subsystem 1) – Data Profiling via SSIS Data Flow – Pipeline Column Data Profiling

Kimball ETL (subsystem 1) – Data Profiling via SSIS Data Flow – Pipeline Column Data Profiling

In a recent article from the Kimball University: The Subsystems of ETL Revisited, published on October 21, 2007 in Intelligent Enterprise, by Bob Becker of the Kimball University states

“Kimball University: The Subsystems of ETL Revisited:These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.”

In particular the first sub system he describes is Data Profiling.

“Data Profiling (subsystem 1) — Explores a data source to determine its fit for inclusion as a source and the associated cleaning and conforming requirements.”

In this article and several others to follow I will cover the SSIS techniques and available products that can be utilized to support the “The Subsystems of ETL” as defined by the Kimball University via SSIS.

This topic is also cover in Tod McKenna’s, blog.

In the interest of full disclosure, one of the solutions I cover will be from a company I have an interest in Actuality Business Intelligence.

A brief review of the SSIS “Pipeline” Architecture

The primary advantage of the SSIS architecture is the “pipeline”. The objective is for the developer to pass the columns needed for transformation and output through the pipeline and to reduce the staging of columns in tables. Secondly a major feature of “pipeline” processing is that as a developer you can use transforms to split and proactively redirect records and take actions on the fly and in memory, aka “pipeline”.

In order to leverage the “pipeline” a method of analyzing columns and there values is needed. Column analysis is usually accomplished by using data profiling tools, hard coded queries or as provided in SQL 2008 a Data Profiling Control Flow. However these methods produce tables or reports and are not easily available in an SSIS Data Flow.

SSIS Data Profiling using supplied Data Flow Components

I have used a script component in the past and accomplished some of the “pipeline” profiling techniques. See SQLServerrCentral post from May 2007, Data Profiling in SSIS.

 

Adaptive Cleansing

In many cases a developer needs to know the characteristics or “profile” of a column and be able to take actions on records (cleaning, rejecting, assign error) in a proactive manner. Meaning that as you process new inputs through your data flow over time you will encounter new errors for specific columns, and will want to ensure the package runs and an identify any new anomalies in records and processes them successfully. In the example I provided last year I demonstrated using the Script Component to identify value patterns (Column Domains) and using the conditional Split Transform route the Valid and Invalid records appropriately. I have included the code below and the entire package is available from Data Profiling in SSIS.

 

Script Component Column Profiling

image001

 

Invalid Records Conditional Split

image002 

Note the Script Component has identified the specific patterns for the Phone column and some of the invlalid vales are now “tagged” and can be rerouted or acted on in some other way.

 

Valid Records from Condition Split 

image003

 

The valid records can continue on through the pipeline.

 

Data Profiling via Script component

‘ Microsoft SQL Server Integration Services user script component
‘ This is your new script component in Microsoft Visual Basic .NET
‘ ScriptMain is the entrypoint class for script components

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain
    Inherits UserComponent
    Dim patAlphaUpper As String = "[^a-z0-9 ,@%&/#’.-]"
    Dim patAlphaLower As String = "[^A-Z0-9 ,@%&/#’.-]"
    Dim patNum As String = "[^A-Za-z ,@%&/#’.-]"
    Dim patSpecial As String = "[^A-Za-z0-9]"
    Dim patEmail As String = "b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}b"
    Dim patUSPhone As String = "^(?:(?<1>[(])?(?<AreaCode>[1-9]d{2})(?(1)[)])(?(1)(?<2>[ ])|(?:(?<3>[-])|(?<4>[ ])))?)?(?<Prefix>[1-9]d{2})(?(AreaCode)(?:(?(1)(?(2)[- ]|[-]?))|(?(3)[-])|(?(4)[- ]))|[- ]?)(?<Suffix>d{4})$"
    Dim rgxAlphaUpper As New Text.RegularExpressions.Regex(patAlphaUpper)
    Dim rgxAlphalower As New Text.RegularExpressions.Regex(patAlphaLower)
    Dim rgxSpecial As New Text.RegularExpressions.Regex(patSpecial)
    Dim rgxNum As New Text.RegularExpressions.Regex(patNum)
    Dim rgxEmail As New Text.RegularExpressions.Regex(patEmail)
    Dim rgxPhone As New Text.RegularExpressions.Regex(patUSPhone)
    Dim tagAlphaUpper As String
    Dim tagAlphaLower As String
    Dim tagNumber As String

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ‘
        ‘ Add your code here
        ‘

        Row.EmailLength = Len(Row.EmailAddress)
        tagAlphaUpper = rgxAlphaUpper.Replace(Row.EmailAddress, "A")
        tagAlphaLower = rgxAlphalower.Replace(tagAlphaUpper, "a")
        Row.EmailPattern = rgxNum.Replace(tagAlphaLower, "9")
        Row.EmailFlag = rgxEmail.Match(Row.EmailAddress).Success
        Row.PhoneLength = Len(Row.Phone)
        Row.PhoneFlag = rgxPhone.Match(Row.Phone).Success
        tagAlphaUpper = rgxAlphaUpper.Replace(Row.Phone, "A")
        tagAlphaLower = rgxAlphalower.Replace(tagAlphaUpper, "a")
        Row.PhonePattern = rgxNum.Replace(tagAlphaLower, "9")
        Row.Count = 1

    End Sub

End Class

 

 

Pattern Summary from Aggregate Transform

 

image004

In addition we have provided an output for a report summarizing the Script Component Pattern Profiling results.

Reroute records through Condition Split

image005

The goal is to reroute records in the “pipeline” proactively into the appropriate stream. In this case Valid and Invalid Destinations.

However this requires manually coding the required actions and is not easily portable or reusable.

In addition, some of the shortfalls of this solution involve have to modify the Input and Output columns each time you want to process a new input.

So at best our solution via the Script Component is helpful, but a “one off” effort, with much custom work required in order to use theses techniques in other packages.

SSIS Data Profiling using off the shelf Data Flow Component

An obvious solution would for Microsoft to provide discrete data profiling Data Flow Transforms or perhaps an industrious ISV.

A solution now exists, in the Data Flows from Actuality Business Intelligence. I will discuss other vendors in future articles. To date no other vendors have provided in line Data Profiling in the form of a Data Transformation as opposes to a Control Flow.

Now let’s examine implementing the same solution, with custom ABI Data Flow components.

image

 

We will use three Data Flow custom components as follows:

ABI Column Profiler –

Calculates profiling statistics for a specific column in the pipeline. Min, Max , Null count etc…

ABI Column Pattern –

Determines the pattern for a specific column in the pipeline, such as 999-99-999 for a US Phone number or 999-99-9999 for a SSN and make is available to downstream transformations.

ABI Column Domain Pattern –

Determines the of distinct values for a specific column and the counts for each vales, as well as, the associated pattern. For SSN:

 

Column Name Pattern Value
SSN 999-99-9999 123-45-6789
SSN 999-99-9999 123-45-6790
SSN 999-99-999z 123-45-890i

 

In this article we will examine the Column Profiler in detail. Additional ABI, components as well as others will be covered in future articles.

 

SSIS ABI Column Profiler

image

Once you  have installed the ABI SSIS Column Data Profile you can select it from the Toolbox.

SSIS ABI Column Profiler – Column Selection

image

Once you have connected a Source to the component, a very simple interface is provided to select the columns to profile or pass through.

Available Output Columns are as follows:

 image

The output columns can be routed to any other Transform or Data Flow Destination.

 

image

In addition a table, file or xml output can be created to support or maintain a custom metadata repository or create a structure to support a data driven application.

DDL Created from SSIS Column Profiler: 

CREATE TABLE [ColumnProfileStats] (

[SortKeyPosition] SMALLINT,

[ComparisonFlag] SMALLINT, 

[Precision] SMALLINT,

[Length] SMALLINT,

[CodePage] SMALLINT,

[InferredDataType] VARCHAR(15),

[SQLDataType] VARCHAR(20),

[BlankDomainRatio] REAL,

[BlankDomainFrequency] REAL,

[NullDomainRatio] REAL,

[NullDomainFrequency] REAL,

[MaxDomainValue] VARCHAR(2000),

[MinDomainValue] VARCHAR(100),

[MostPopularDomainRatio] REAL,

[MostPopularDomainFrequency] REAL,

[MostPopularDomainValue] VARCHAR(2000),

[DistinctDomainRatio] REAL,

[DistinctDomainFrequency] REAL,

[TotalRecordCount] REAL,

[ColumnName] VARCHAR(255)

)

The source code for the Script Component solution can be obtained from Data Profiling in SSIS on JumpStartTV. The custom column profiling component can be downloaded from Actuality Business Intelligence.

 

 

image

 

image

Ira W. Whiteside

Actuality Business Intelligence LLC

Advertisements

About irawarrenwhiteside

BI Architect
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Kimball ETL (subsystem 1) – Data Profiling via SSIS Data Flow – Pipeline Column Data Profiling

  1. Pingback: Data Profiling via SSIS – Link « Tech Hub

  2. Տuperbe article, pérennise dans cettе voie

  3. Un article rempli de superbes conseils

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s