Master Data Services – SSIS MDM Load

Perception is Perception “Awareness is Reality”

The great philosophical lesson I think I have learned is that “Perception is NOT Reality”. I have always used the phrase “Perception is Reality” and referred to Plato’s Allegory of the Cave, smugly interpreting it for people unfamiliar. It was not until the fear of losing my soul mate , my wife, my love Tessie , that I realized the true meaning. I would always relate it to the common conundrum of a person trying to understand why they were misunderstood or not appreciated, and what they could do to change the reality or perception of others. I have now been enlightened It is now obvious to me that I am an idiot. Mine and Tessie’s Perception/Reality was that her moms cancer and death was from second hand smoke, that is what the doctors told her ,so if she didn’t smoke and basically ate right and her sisters didn’t have cancer she was safe. Our reality was DNA. Perception is what a single person chooses to believe , base their decisions on and has nothing to do with Reality. I am an idiot. Everything I have learned about cancer in the last few months has been there for years , my knowledge has completely changed my Perception and has not altered our Reality, it has increased our awareness. I have read the allegory hundreds of times and only now have paid attention to one of the last paragraphs.

“And now consider what would happen if such a man were to descend again and seat himself on his old seat? Coming so suddenly out of the sun, would he not find his eyes blinded with the gloom of the place?

Certainly, he would.

And if he were forced to deliver his opinion again, touching the shadows aforesaid, and to enter the lists against those who had always been prisoners, while his sight continued dim and his eyes unsteady, – and if this process of initiation lasted a considerable time, – would he not be made a laughingstock, and would it not be said of him, that he had gone up only to come back again with his eyesight destroyed, and that it was not worth while even to attempt the ascent? And if anyone endeavored to set them free and carry them to the light, would they not go so far as to put him to death, if they could only manage to get him into their power?”

I always focused on the fact that the prisoner became enlightened and gained awareness and a different and I thought better Perception and changed his Reality, and I was very pleased with myself in comprehending this. The revelation to me now is that the experience(awareness) that changed his Perception of his Perception , did not change his Reality, it simply made him aware of his incorrect awarement and led to a quicker death. Apparently Perception is Perception and changes frequently, based on your current awareness and Reality is a constant and unchanging. How can it change , since as soon as you think about it it’s gone. Changing your Perception can prolong your ability to experience(gain knowledge/awareness) of Reality , it does not change Reality. I have been pondering lately if “Awareness is Reality”

In any event I have endeavored to increase our awareness of awarement regarding cancer and prolong our ability to experience reality.

Providing Operational Analytics – Microsoft Analysis Services / Informatica / QlikView

1 Reply

Real Time Update Scenarios

Ira Warren Whiteside

Recently a client has described a need for real time (or near-real time) movement of data from their operational logistics application to an analytics/reporting visualization environment where presentation tools such as Microsoft Reporting Services, Business Objects or QlikView consume the data. First we will focus on the “Real Time” update scenarios. We can address presentation in a separate document. Based on the discussion we noted three possible approaches to solving this problem.

1. Replicate Tables using SQL Server Replication and use Business Objects directly against the OLTP.

2. Utilize a real time update process (Informatica/SQL Server) to incrementally update tables and use Business Objects directly against the OLTP.

3. Use a OLAP or Memory based approach, using various compression techniques, eliminating the need for additional relational tables and SQL based query tools.

In our discussion several vendors were mentioned:

1. Informatica via Informatica PowerExchange™:MS SQL Server Real-Time Option for updated relational tables. This is a change data capture tool.

2. QlikView – A Memory based reporting and querying tool.

3. SQL Server 2005 – Microsoft DBMS , OLAP and associated applications (SSIS – SQL Server Integration Services, SSAS – SQL Server Analysis Services)

There are other solutions that are not relational based such as Netezza or Sand Technology. I would suggest a brief review of Gartner current analysis as it relates to real time updates.

I would like to add the observation that the transfer problem is a separate logical problem from the analysis visualization problem. The three tools discussed all approach the problem completely differently. Only QlikView attempts to present a solution that combines the two.

There are two approaches to near-real time analytics. They are as follows:

Operational Analysis based on loading the source data into compressed proprietary data stores without transformations (Cleansing, Normalization, Hierarchies). This will provide analytical capabilities including metrics relationships and hierarchies that exist in the data model being sourced.
- The limitations here are no trending, history or complex business rules, metrics that require corporate reference data such as product or organization hierarchies.
- A benefit in relation to the extraction process is this approach does not require any staging area.
Traditional Business Intelligence is based on a metadata driven approach were the source data is transformed to properly analyze a specific set of business metrics and their associated business process hierarchies, including trending and history.
- A limitation of approach will require a more complex extraction and loading process and a staging area.
- A major benefits is this approach will “insulate” the reporting or analytical layer from any changes or additions to the source. This is accomplished through using a data driven approach and creating a business dimensional oriented semantic layer. In most cases the metric and dimensions math the business processes and do not change over time, where as the source data and nomenclature is volatile.

I conducted a brief analysis of the vendors mentioned and have the following observations:

Informatica – There “Real Time” offering for Microsoft relies on SQL Server 2005 Replication and Publishing and concentrates on providing CDC (Change Data Capture” technology for applications without time date stamps of update switches.

a. This would require replicating the existing database model and reporting directly against it or developing a dimensional model.

QlikView – This is a Memory based tool requiring various QlikView servers for processing and publishing the QlikView’s based on the QlikView Publisher. QlikView has a wizard to create load and transformation scripts. It assumes relationships based on same named fields. In most cases per the recommendation this needs to be manually modified.

SQL Server 2005 Analysis Services is similar to QlikView in architecture. The specific “real time “features are Snapshot Isolation and Proactive Caching. Snapshot Isolation insures the data extracted is kept in sync by only pulling data available when the first table is queried. Defining Caching Policies Analysis Services provides high query performance by maintaining a cache of the underlying data in a form optimized for analytical queries. The new Analysis Services 2005 proactive caching feature provides a way for the cube designer to define policies that balance the business needs for performance against those for real-time data. The system runs on “auto pilot,” without the need for explicit processing.

Different policies allow for quite different behaviors. For example:

Automatic MOLAP. Real-time data.
Scheduled MOLAP. The cube is simply refreshed periodically, irrespective of updates to the source data.
Incremental MOLAP. The cube is refreshed incrementally with new data in the source database.
Real-time ROLAP. The cube is always in ROLAP mode, and all updates to the source data are immediately reflected in the query results.

The proactive caching settings can be set on each different Analysis Services object (dimension and partition). This provides great flexibility in cases where the policies and the tradeoffs between performance and latency vary for different parts of the cubes .

Microsoft Cubes in conjunction with the SSIS (Pipeline Architecture) and UDM (Universal Data Model) do not require any staging.

Recommendations: I would recommend a POC(Proof of Concept) and the creation of two scenarios, focusing on the two OLAP type options using QlikView and SQL Server 2005 Analysis Services. As part of the scenarios I suggest:

Validating existing business requirements including metric formulas and required hierarchies
Defining the metadata driven Extraction Architecture (independently of tool).

In reviewing the two tools it is apparent the extraction processes would be similar and SQL based, but relying on different scheduling and/or scheduling processes. Also it would seem using views for database would be rejected out of hands due to performance. A best practices Extraction Architecture would require a metadata driven application architecture that would be data aware of structure or content (domain) changes and automatically adjust load processes accordingly. Each solution would probably require replication. The focus of completing the scenarios would be to evaluate the gaps each of the recommended scenarios/tools have against Extraction Architecture requirements, effort and skill set required to complete the POC.

Scalability
- For QlikView you would need to manually design our queries. In their documentation they mention loading 4 Million rows per minute, without knowing the number of columns or hardware configuration this not very informative.
- MSAS Using the UDM and Distributed Query capability of MSAS is capable of 120,000 rows per second or 7 million rows per hour on a Dual-core Xeon (EM64T). [1]
Testing Approach – I would require full volume testing based on several documented update scenarios and the full spectrum of user queries and the testing of dynamic queries or queries determine if you will need joins in QlikView or Hierarchies in MSAS
Source DB impact – We would need to inform and involve the Source DB team in our Extract Architecture and work toward direct access to their tables, most likely in a replicated environment. Each tool (MSAS, QlikView) recommends extracting only the necessary attributes and rows and requires manually adjusting same named fields and joins..[2]
Team Structure – I would structure the team around the Extraction Architecture with individuals focusing on Source Profiling, Source Extraction, Load Scripts/Design and finally presentation.

[1] Microsoft SQL Server 2005 analysis Services Performance Guide

[2] Actually QlikView requires significant manual coding for loading. Specifically QlikView assumes all like field names are keys and attempts to create joins, these must be hand coded for only the required fields FAQ 7 in their doc.

Roll your own Data Profiling – Column Functional Dependency

Actuality Business Intelligence

A while back I was reading one of Jamie Thomson most excellent posts on Conchango blogs SSIS Junkie SSIS: Data Profiling Task: Part 7 – Functional Dependency and it occurred to me it would be interesting to provide a poor man’s Functional Dependency using on TSQL.

First let’s define Functional Dependency.

Functional Dependency

Determines the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column) ; This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the dependency between a column that contains Country or Region Codes and a column that contains States/Provinces. The same Country or Region Codes should always have the same States/Provinces, but the profile discovers violations of this dependency.

Now let’s take a crack at writing our own. Naturally when facing a difficult problem in TSQL I turned to SQL Server Central vast library.

When analyzing the capabilities needed to determine one columns dependency on another you would obviously need recursion or the ability to link one row to the next. A brief search of existing articles SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13 and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

Armed with these two articles I now had all the techniques I need to develop a query or stored procedure to provide Column Functional Dependency.

TSQL Functional Dependency

The heart of this effort is to utilize Common Table Expressions(CTE) or the “with” clause as defined on MSDN as:

“Specifies a temporary named result set, known as a common table expression (CTE). This is derived from a simple query and defined within the execution scope of a single SELECT, INSERT, UPDATE, or DELETE statement. This clause can also be used in a CREATE VIEW statement as part of its defining SELECT statement. A common table expression can include references to itself. This is referred to as a recursive common table expression.”

Basically the “with” clause provides you with the ability to logically create CTE’s or result sets ,logically develop your code and employ recursive techniques.

A detailed treatment of CTE’s is available at and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

Setup Test Data

I have used the same test setup that Jamie Thomson provides for testing the SQL Server 2008 Data Profiling Task.

select ROW_NUMBER() OVER (partition by sp.Name order by a.PostalCode) RowNum

, a.PostalCode

, sp.Name AS StateProvinceName

, sp.CountryRegionCode

into Addresses

from Person.[Address] a

inner join Person.StateProvince sp

on a.StateProvinceID = sp.StateProvinceID

update Addresses

set CountryRegionCode = ‘ZZ’

where RowNum = 1

This will create a test table called Addresses, which contains a Functional Dependency scenario for Province or States and Countries.

Example Functional Dependency Stored Procedure

USE [AdventureWorks]

/****** Object: Table [dbo].[DependencyColumnTable] Script Date: 01/14/2009 19:07:54 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[DependencyColumnTable]’) AND type in (N’U’))

DROP TABLE [dbo].[DependencyColumnTable]

/****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

USE [AdventureWorks]

/****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependency]’) AND type in (N’P’, N’PC’))

DROP PROCEDURE [dbo].[ColumnDependency]

USE [AdventureWorks]

/****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

CREATE PROCEDURE [dbo].[ColumnDependency]

@TableName VARCHAR(128) ,

@Determinant VARCHAR(128) ,

@Dependent VARCHAR(128)

Declare @SQL VarChar(MAX)

SELECT @SQL = ‘ ;WITH DependencyCompare AS (‘

SELECT @SQL = @SQL + ‘ SELECT ‘

SELECT @SQL = @SQL + @Determinant + ‘ ,’ + @Dependent + ‘, ‘

SELECT @SQL = @SQL + ‘ROW_NUMBER() OVER (Partition BY ‘ + @Dependent + ‘ ORDER BY ‘ + @Determinant + ‘ ) AS rownumcalc FROM ‘

SELECT @SQL = @SQL + @TableName

SELECT @SQL = @SQL + ‘ ) SELECT

currow.’ + @Dependent + ‘ AS curDet,

prevrow.’ + @Dependent + ‘ AS prevDet,

nextrow.’ + @Dependent + ‘ AS nextDet,

currow.’ + @Determinant + ‘ AS curDep,

prevrow.’ + @Determinant + ‘ AS prevDep,

nextrow.’ + @Determinant

+ ‘ AS nextDep,

currow.rownumcalc ,

ViolationCount =

Case

When currow.’ + @Determinant + ‘ = prevrow.’ + @Determinant +

‘ and currow.’ + @Dependent + ‘ = prevrow.’ + @Dependent +

‘ then 0

When prevrow.’ + @Dependent + ‘ is null

then 0

Else 1

End’

‘ INTO DependencyColumnTable

FROM

DependencyCompare currow ‘

+ ‘ LEFT JOIN DependencyCompare nextrow ON currow.rownumcalc = nextrow.rownumcalc – 1 AND currow.’

+ @Dependent

+ ‘ = nextrow.’

+ @Dependent

+ ‘ LEFT JOIN DependencyCompare prevrow ON currow.rownumcalc = prevrow.rownumcalc + 1 AND currow.’

+ @Dependent

+ ‘ = prevrow.’

+ @Dependent

+ ‘ Order by currow.’

+ @Dependent

Print @SQL

Exec (@SQL)

Exec ColumnDependency

@TableName = ‘Addresses’,

@Dependent = ‘StateProvinceName’,

@Determinant = ‘CountryRegionCode ‘

The code here is basically leveraging the recursive capabilities of CTE. Our objective here is to compare current rows to previous row columns to determine when a Dependent Column value changes and the Determinant columns has NOT changes. This is a Violation. A detailed explanation of the technique of linking to previous records and be reviewed at SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13. One issue I had was needing to dynamically populate the From clause or Table.

I used the technique presented by Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001. This technique relies in creating a @SQL variable and executing the variable as part of the stored procedure.

Create Procedure GenericTableSelect

@TableName VarChar(100)

Declare @SQL VarChar(1000)

SELECT @SQL = ‘SELECT * FROM ‘

SELECT @SQL = @SQL + @TableName

Exec ( @SQL)

Example Functional Dependency Results

As you can see here Alabama is our Determinant column which normally has an associated Dependent Values of US. You can see when the Dependent values changes to ZZ it is tagged a Violation Count as ‘1’

Example Functional Summary

/* Sample Execution

exec ColumnDependencyOLAP

/* This SP will create a Summary Table and calculate the Support and Vio;ation percentages.

exec ColumnDependencyOLAP

USE [AdventureWorks]

/****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

USE [AdventureWorks]

/****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependencyOLAP]’) AND type in (N’P’, N’PC’))

DROP PROCEDURE [dbo].[ColumnDependencyOLAP]

/****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

CREATE PROCEDURE [dbo].[ColumnDependencyOLAP]

Declare @SQL VarChar(MAX)

SELECT @SQL = ‘

;WITH DependencyCompareDetail AS

(

Select count(ViolationCount) as Violations , curDet , curDep,

Case when ViolationCount = 0

then ”Support Percentage”

else ”Violation Percentage”

end as Violation ,

ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ,curDep) AS rownumcalc

FROM [AdventureWorks].[dbo].[DependencyColumnTable]

–Where ViolationCountNull = ”N”

group by curDet, curDep , ViolationCount

)

,DependencyCompareSummary AS (

Select count(ViolationCount) as total , curDet ,

ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ) AS rownumcalc

FROM [AdventureWorks].[dbo].[DependencyColumnTable]

–Where ViolationCountNull = ”N”

group by curDet

)

,DependencyCompareOLAP AS

(

Select

DependencyCompareDetail.curDet ,

DependencyCompareDetail.curDep,

DependencyCompareDetail.Violation,

DependencyCompareDetail.Violations ,

DependencyCompareSummary.total ,

ROW_NUMBER() OVER (Partition BY DependencyCompareDetail.curDet ORDER BY DependencyCompareDetail.curDep) AS rownumcalc

From DependencyCompareDetail

LEFT JOIN DependencyCompareSummary

ON DependencyCompareDetail.curDet =DependencyCompareSummary.curDet

)

Select

curDet as Determinant,

curDep as Dependent,

Violation as ”Dependency Type”,

Violations as ”Dependeny Counts” ,

total as ”Total Determinant Count”,

cast(cast(Violations AS Decimal ) /cast(total AS Decimal )* 100 as decimal (5, 2)) as PercentViolation

INTO DependencyCompareOLAP

from DependencyCompareOLAP

Order by curDet , curDep ‘

Print @SQL

Exec (@SQL)

Now that we have created a table with the Functional Dependency Violations identified we want to create a table that can be used for Analysis or Reporting, containing the same percentages available in the SSIS 2008n Data Profiling Task Functional Option. Again we have relied on the CTE in order to create a reusable query that can create a single table that has the required for analysis such as Totlal Determinint , Dependents categroized by Supporting Dependents and Violating Dependents.

Example Functional Summary Results

Here are the results from the SSIS 2008 Data Profiling Task for Functional Dependency.

As you can see the Support Percentages for Alabama are exactly the same.

Roll your Own Interactive Scorecard

I would also add from an analytical perspective you can quickly create a Pivot Table in Microsoft Excel 2007. Simply open Excel 2007, select the Data Tab .

Then select From Other Sources then follow the wizard and create a connection to the DependencyCompareOLAP table to create the Pivot table and Chart below. As you can see here is a quick way to determine the outliers like Alabama.

Or Manitoba in Canada

Further Enhancements

You also can add additional columns to the table and stored procedures, such as table name, date and time and create a repository that would allow you to track the Violations incrementally or over time, and provide proactive processes to respond to continuing anomalies.

Here are several additional links that were helpful.

2. Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001

3. SQL Server 2005 Recursion and WITH Clause By Robbe D. Morris

4. The SQL WITH Clause, (Order Your Thoughts, Reuse Your Code) Submitted by Kevin Meade on Tue, 2007-06-19 14:59

Kimball ETL (subsystem 1) – Data Profiling via SSIS Data Flow – Pipeline Column Data Profiling

3 Replies

Kimball ETL (subsystem 1) – Data Profiling via SSIS Data Flow – Pipeline Column Data Profiling

In a recent article from the Kimball University: The Subsystems of ETL Revisited, published on October 21, 2007 in Intelligent Enterprise, by Bob Becker of the Kimball University states

“Kimball University: The Subsystems of ETL Revisited:These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.”

In particular the first sub system he describes is Data Profiling.

“Data Profiling (subsystem 1) — Explores a data source to determine its fit for inclusion as a source and the associated cleaning and conforming requirements.”

In this article and several others to follow I will cover the SSIS techniques and available products that can be utilized to support the “The Subsystems of ETL” as defined by the Kimball University via SSIS.

This topic is also cover in Tod McKenna’s, blog.

In the interest of full disclosure, one of the solutions I cover will be from a company I have an interest in Actuality Business Intelligence.

A brief review of the SSIS “Pipeline” Architecture

The primary advantage of the SSIS architecture is the “pipeline”. The objective is for the developer to pass the columns needed for transformation and output through the pipeline and to reduce the staging of columns in tables. Secondly a major feature of “pipeline” processing is that as a developer you can use transforms to split and proactively redirect records and take actions on the fly and in memory, aka “pipeline”.

In order to leverage the “pipeline” a method of analyzing columns and there values is needed. Column analysis is usually accomplished by using data profiling tools, hard coded queries or as provided in SQL 2008 a Data Profiling Control Flow. However these methods produce tables or reports and are not easily available in an SSIS Data Flow.

SSIS Data Profiling using supplied Data Flow Components

I have used a script component in the past and accomplished some of the “pipeline” profiling techniques. See SQLServerrCentral post from May 2007, Data Profiling in SSIS.

Adaptive Cleansing

In many cases a developer needs to know the characteristics or “profile” of a column and be able to take actions on records (cleaning, rejecting, assign error) in a proactive manner. Meaning that as you process new inputs through your data flow over time you will encounter new errors for specific columns, and will want to ensure the package runs and an identify any new anomalies in records and processes them successfully. In the example I provided last year I demonstrated using the Script Component to identify value patterns (Column Domains) and using the conditional Split Transform route the Valid and Invalid records appropriately. I have included the code below and the entire package is available from Data Profiling in SSIS.

Script Component Column Profiling

Invalid Records Conditional Split

Note the Script Component has identified the specific patterns for the Phone column and some of the invlalid vales are now “tagged” and can be rerouted or acted on in some other way.

Valid Records from Condition Split

The valid records can continue on through the pipeline.

Data Profiling via Script component

‘ Microsoft SQL Server Integration Services user script component
‘ This is your new script component in Microsoft Visual Basic .NET
‘ ScriptMain is the entrypoint class for script components

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain
    Inherits UserComponent
    Dim patAlphaUpper As String = "[^a-z0-9 ,@%&/#’.-]"
    Dim patAlphaLower As String = "[^A-Z0-9 ,@%&/#’.-]"
    Dim patNum As String = "[^A-Za-z ,@%&/#’.-]"
    Dim patSpecial As String = "[^A-Za-z0-9]"
    Dim patEmail As String = "b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}b"
    Dim patUSPhone As String = "^(?:(?<1>[(])?(?<AreaCode>[1-9]d{2})(?(1)[)])(?(1)(?<2>[ ])|(?:(?<3>[-])|(?<4>[ ])))?)?(?<Prefix>[1-9]d{2})(?(AreaCode)(?:(?(1)(?(2)[- ]|[-]?))|(?(3)[-])|(?(4)[- ]))|[- ]?)(?<Suffix>d{4})$"
    Dim rgxAlphaUpper As New Text.RegularExpressions.Regex(patAlphaUpper)
    Dim rgxAlphalower As New Text.RegularExpressions.Regex(patAlphaLower)
    Dim rgxSpecial As New Text.RegularExpressions.Regex(patSpecial)
    Dim rgxNum As New Text.RegularExpressions.Regex(patNum)
    Dim rgxEmail As New Text.RegularExpressions.Regex(patEmail)
    Dim rgxPhone As New Text.RegularExpressions.Regex(patUSPhone)
    Dim tagAlphaUpper As String
    Dim tagAlphaLower As String
    Dim tagNumber As String

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ‘
        ‘ Add your code here
        ‘

        Row.EmailLength = Len(Row.EmailAddress)
        tagAlphaUpper = rgxAlphaUpper.Replace(Row.EmailAddress, "A")
        tagAlphaLower = rgxAlphalower.Replace(tagAlphaUpper, "a")
        Row.EmailPattern = rgxNum.Replace(tagAlphaLower, "9")
        Row.EmailFlag = rgxEmail.Match(Row.EmailAddress).Success
        Row.PhoneLength = Len(Row.Phone)
        Row.PhoneFlag = rgxPhone.Match(Row.Phone).Success
        tagAlphaUpper = rgxAlphaUpper.Replace(Row.Phone, "A")
        tagAlphaLower = rgxAlphalower.Replace(tagAlphaUpper, "a")
        Row.PhonePattern = rgxNum.Replace(tagAlphaLower, "9")
        Row.Count = 1

End Sub

End Class

Pattern Summary from Aggregate Transform

In addition we have provided an output for a report summarizing the Script Component Pattern Profiling results.

Reroute records through Condition Split

The goal is to reroute records in the “pipeline” proactively into the appropriate stream. In this case Valid and Invalid Destinations.

However this requires manually coding the required actions and is not easily portable or reusable.

In addition, some of the shortfalls of this solution involve have to modify the Input and Output columns each time you want to process a new input.

So at best our solution via the Script Component is helpful, but a “one off” effort, with much custom work required in order to use theses techniques in other packages.

SSIS Data Profiling using off the shelf Data Flow Component

An obvious solution would for Microsoft to provide discrete data profiling Data Flow Transforms or perhaps an industrious ISV.

A solution now exists, in the Data Flows from Actuality Business Intelligence. I will discuss other vendors in future articles. To date no other vendors have provided in line Data Profiling in the form of a Data Transformation as opposes to a Control Flow.

Now let’s examine implementing the same solution, with custom ABI Data Flow components.

We will use three Data Flow custom components as follows:

ABI Column Profiler –

Calculates profiling statistics for a specific column in the pipeline. Min, Max , Null count etc…

ABI Column Pattern –

Determines the pattern for a specific column in the pipeline, such as 999-99-999 for a US Phone number or 999-99-9999 for a SSN and make is available to downstream transformations.

ABI Column Domain Pattern –

Determines the of distinct values for a specific column and the counts for each vales, as well as, the associated pattern. For SSN:

Column Name	Pattern	Value
SSN	999-99-9999	123-45-6789
SSN	999-99-9999	123-45-6790
SSN	999-99-999z	123-45-890i

In this article we will examine the Column Profiler in detail. Additional ABI, components as well as others will be covered in future articles.

SSIS ABI Column Profiler

Once you have installed the ABI SSIS Column Data Profile you can select it from the Toolbox.

SSIS ABI Column Profiler – Column Selection

Once you have connected a Source to the component, a very simple interface is provided to select the columns to profile or pass through.

Available Output Columns are as follows:

The output columns can be routed to any other Transform or Data Flow Destination.

In addition a table, file or xml output can be created to support or maintain a custom metadata repository or create a structure to support a data driven application.

DDL Created from SSIS Column Profiler:

CREATE TABLE [ColumnProfileStats] (

[SortKeyPosition] SMALLINT,

[ComparisonFlag] SMALLINT,

[Precision] SMALLINT,

[Length] SMALLINT,

[CodePage] SMALLINT,

[InferredDataType] VARCHAR(15),

[SQLDataType] VARCHAR(20),

[BlankDomainRatio] REAL,

[BlankDomainFrequency] REAL,

[NullDomainRatio] REAL,

[NullDomainFrequency] REAL,

[MaxDomainValue] VARCHAR(2000),

[MinDomainValue] VARCHAR(100),

[MostPopularDomainRatio] REAL,

[MostPopularDomainFrequency] REAL,

[MostPopularDomainValue] VARCHAR(2000),

[DistinctDomainRatio] REAL,

[DistinctDomainFrequency] REAL,

[TotalRecordCount] REAL,

[ColumnName] VARCHAR(255)

)

The source code for the Script Component solution can be obtained from Data Profiling in SSIS on JumpStartTV. The custom column profiling component can be downloaded from Actuality Business Intelligence.

Ira W. Whiteside

Actuality Business Intelligence LLC