Roll your own Data Profiling – Column Functional Dependency

By Ira Warren Whiteside

 

clip_image001[1]

Actuality Business Intelligence

 

 

A while back I was reading one of Jamie Thomson most excellent posts on Conchango blogs SSIS Junkie SSIS: Data Profiling Task: Part 7 – Functional Dependency and it occurred to me it would be interesting to provide a poor man’s Functional Dependency using on TSQL.

 

First let’s define Functional Dependency.

Functional Dependency

Determines the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column) ; This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the dependency between a column that contains Country or Region Codes and a column that contains States/Provinces. The same Country or Region Codes should always have the same States/Provinces, but the profile discovers violations of this dependency.

 

Now let’s take a crack at writing our own. Naturally when facing a difficult problem in TSQL I turned to SQL Server Central vast library.

 

When analyzing the capabilities needed to determine one columns dependency on another you would obviously need recursion or the ability to link one row to the next. A brief search of existing articles SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13 and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

 

Armed with these two articles I now had all the techniques I need to develop a query or stored procedure to provide Column Functional Dependency.

 

TSQL Functional Dependency

The heart of this effort is to utilize Common Table Expressions(CTE) or the “with clause as defined on MSDN as:

 

“Specifies a temporary named result set, known as a common table expression (CTE). This is derived from a simple query and defined within the execution scope of a single SELECT, INSERT, UPDATE, or DELETE statement. This clause can also be used in a CREATE VIEW statement as part of its defining SELECT statement. A common table expression can include references to itself. This is referred to as a recursive common table expression.”

 

Basically the “with” clause provides you with the ability to logically create CTE’s or result sets ,logically develop your code and employ recursive techniques.

 

A detailed treatment of CTE’s is available at and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

 

Setup Test Data

I have used the same test setup that Jamie Thomson provides for testing the SQL Server 2008 Data Profiling Task.

 

 

select ROW_NUMBER() OVER (partition by sp.Name order by a.PostalCode) RowNum

, a.PostalCode

, sp.Name AS StateProvinceName

, sp.CountryRegionCode

into Addresses

from Person.[Address] a

inner join Person.StateProvince sp

on a.StateProvinceID = sp.StateProvinceID

update Addresses

set CountryRegionCode = ‘ZZ’

where RowNum = 1

 

 

 

This will create a test table called Addresses, which contains a Functional Dependency scenario for Province or States and Countries.

 

 

clip_image003

 

 

 

 

 

Example Functional Dependency Stored Procedure

 

 

USE [AdventureWorks]

GO

/****** Object: Table [dbo].[DependencyColumnTable] Script Date: 01/14/2009 19:07:54 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[DependencyColumnTable]’) AND type in (N’U’))

DROP TABLE [dbo].[DependencyColumnTable]

GO

/****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependency]’) AND type in (N’P’, N’PC’))

DROP PROCEDURE [dbo].[ColumnDependency]

GO

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE PROCEDURE [dbo].[ColumnDependency]

@TableName VARCHAR(128) ,

@Determinant VARCHAR(128) ,

@Dependent VARCHAR(128)

AS

Declare @SQL VarChar(MAX)

SELECT @SQL = ‘ ;WITH DependencyCompare AS (‘

SELECT @SQL = @SQL + ‘ SELECT ‘

SELECT @SQL = @SQL + @Determinant + ‘ ,’ + @Dependent + ‘, ‘

SELECT @SQL = @SQL + ‘ROW_NUMBER() OVER (Partition BY ‘ + @Dependent + ‘ ORDER BY ‘ + @Determinant + ‘ ) AS rownumcalc FROM ‘

SELECT @SQL = @SQL + @TableName

SELECT @SQL = @SQL + ‘ ) SELECT

currow.’ + @Dependent + ‘ AS curDet,

prevrow.’ + @Dependent + ‘ AS prevDet,

nextrow.’ + @Dependent + ‘ AS nextDet,

currow.’ + @Determinant + ‘ AS curDep,

prevrow.’ + @Determinant + ‘ AS prevDep,

nextrow.’ + @Determinant

+ ‘ AS nextDep,

currow.rownumcalc ,

ViolationCount =

Case

When currow.’ + @Determinant + ‘ = prevrow.’ + @Determinant +

‘ and currow.’ + @Dependent + ‘ = prevrow.’ + @Dependent +

‘ then 0

When prevrow.’ + @Dependent + ‘ is null

then 0

Else 1

End’

+

‘ INTO DependencyColumnTable

FROM

DependencyCompare currow ‘

+ ‘ LEFT JOIN DependencyCompare nextrow ON currow.rownumcalc = nextrow.rownumcalc – 1 AND currow.’

+ @Dependent

+ ‘ = nextrow.’

+ @Dependent

+ ‘ LEFT JOIN DependencyCompare prevrow ON currow.rownumcalc = prevrow.rownumcalc + 1 AND currow.’

+ @Dependent

+ ‘ = prevrow.’

+ @Dependent

+ ‘ Order by currow.’

+ @Dependent

Print @SQL

Exec (@SQL)

GO

Exec ColumnDependency

@TableName = ‘Addresses’,

@Dependent = ‘StateProvinceName’,

@Determinant = ‘CountryRegionCode ‘

 

 

The code here is basically leveraging the recursive capabilities of CTE. Our objective here is to compare current rows to previous row columns to determine when a Dependent Column value changes and the Determinant columns has NOT changes. This is a Violation. A detailed explanation of the technique of linking to previous records and be reviewed at SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13. One issue I had was needing to dynamically populate the From clause or Table.

 

I used the technique presented by Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001. This technique relies in creating a @SQL variable and executing the variable as part of the stored procedure.

 

 

 

 

Create Procedure GenericTableSelect

@TableName VarChar(100)

AS

Declare @SQL VarChar(1000)

SELECT @SQL = ‘SELECT * FROM ‘

SELECT @SQL = @SQL + @TableName

Exec ( @SQL)

GO

`

 

 

Example Functional Dependency Results

 

 

clip_image005

 

 

 

As you can see here Alabama is our Determinant column which normally has an associated Dependent Values of US. You can see when the Dependent values changes to ZZ it is tagged a Violation Count as ‘1’

 

 

Example Functional Summary

 

 

/* Sample Execution

exec ColumnDependencyOLAP

*/

/* This SP will create a Summary Table and calculate the Support and Vio;ation percentages.

exec ColumnDependencyOLAP

*/

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependencyOLAP]’) AND type in (N’P’, N’PC’))

DROP PROCEDURE [dbo].[ColumnDependencyOLAP]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE PROCEDURE [dbo].[ColumnDependencyOLAP]

AS

Declare @SQL VarChar(MAX)

SELECT @SQL = ‘

;WITH DependencyCompareDetail AS

(

Select count(ViolationCount) as Violations , curDet , curDep,

Case when ViolationCount = 0

then ”Support Percentage”

else ”Violation Percentage”

end as Violation ,

ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ,curDep) AS rownumcalc

FROM [AdventureWorks].[dbo].[DependencyColumnTable]

–Where ViolationCountNull = ”N”

group by curDet, curDep , ViolationCount

)

,DependencyCompareSummary AS (

Select count(ViolationCount) as total , curDet ,

ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ) AS rownumcalc

FROM [AdventureWorks].[dbo].[DependencyColumnTable]

–Where ViolationCountNull = ”N”

group by curDet

)

,DependencyCompareOLAP AS

(

Select

DependencyCompareDetail.curDet ,

DependencyCompareDetail.curDep,

DependencyCompareDetail.Violation,

DependencyCompareDetail.Violations ,

DependencyCompareSummary.total ,

ROW_NUMBER() OVER (Partition BY DependencyCompareDetail.curDet ORDER BY DependencyCompareDetail.curDep) AS rownumcalc

From DependencyCompareDetail

LEFT JOIN DependencyCompareSummary

ON DependencyCompareDetail.curDet =DependencyCompareSummary.curDet

)

Select

curDet as Determinant,

curDep as Dependent,

Violation as ”Dependency Type”,

Violations as ”Dependeny Counts” ,

total as ”Total Determinant Count”,

cast(cast(Violations AS Decimal ) /cast(total AS Decimal )* 100 as decimal (5, 2)) as PercentViolation

INTO DependencyCompareOLAP

from DependencyCompareOLAP

Order by curDet , curDep ‘

Print @SQL

Exec (@SQL)

GO

 

 

Now that we have created a table with the Functional Dependency Violations identified we want to create a table that can be used for Analysis or Reporting, containing the same percentages available in the SSIS 2008n Data Profiling Task Functional Option. Again we have relied on the CTE in order to create a reusable query that can create a single table that has the required for analysis such as Totlal Determinint , Dependents categroized by Supporting Dependents and Violating Dependents.

 

 

Example Functional Summary Results

 

 

 

clip_image007

 

Here are the results from the SSIS 2008 Data Profiling Task for Functional Dependency.

 

clip_image009

 

As you can see the Support Percentages for Alabama are exactly the same.

 

Roll your Own Interactive Scorecard

 

clip_image011

 

I would also add from an analytical perspective you can quickly create a Pivot Table in Microsoft Excel 2007. Simply open Excel 2007, select the Data Tab .

Then select From Other Sources then follow the wizard and create a connection to the DependencyCompareOLAP table to create the Pivot table and Chart below. As you can see here is a quick way to determine the outliers like Alabama.

 

clip_image013

 

Or Manitoba in Canada

 

clip_image015

 

Further Enhancements

 

You also can add additional columns to the table and stored procedures, such as table name, date and time and create a repository that would allow you to track the Violations incrementally or over time, and provide proactive processes to respond to continuing anomalies.

 

Here are several additional links that were helpful.

 

2. Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001

3. SQL Server 2005 Recursion and WITH Clause By Robbe D. Morris

4. The SQL WITH Clause, (Order Your Thoughts, Reuse Your Code) Submitted by Kevin Meade on Tue, 2007-06-19 14:59

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s