Implementing a Metadata Mart the Road to Data Governance best viewed in PRESENTATION Mode, there is animation.
This presentation has narrative, play in presentation mode with sound on.
Implementing a Metadata Mart the Road to Data Governance best viewed in PRESENTATION Mode, there is animation.
This presentation has narrative, play in presentation mode with sound on.
With Data Profiling can apply the age old management adage “You get what you inspect, not what you expect” Readers Digest.
This article will describe how to implement a data profiling dashboard in Excel and a metadata repository as well as the TSQL required to load and update the repository.
Future articles will explore the data model as well as column and table relationship analysis using the Domain profiling results.
Data Profiling is essential to properly determine inconsistencies as well data transformation requirements for integration efforts.
It is also important to be able to communicate the general data quality for the datasets or tables you will be processing.
With the assistance over the years of a few friends(Joe Novella, Scott Morgan and Michael Capes) as well as the work of Stephan DeBlois, I have created a set of TSQL Scripts that will create a set of tables that will provide the statistics to present your clients with a Data Quality Scorecard comparable to existing vendor tools such as Informatica , Data Flux , Data Stage and the SSIS Data Profiling Task.
This article contains a Complete Data Profiling Kit – Download (Code, Excel Dashboards and Samples) providing capabilities similar to leading vendor tools, such as Informatica, Data Stage, Dataflux etc…
The primary difference is the repository is open and the code is also open and available and customizable. The profiling process has 4 steps as follows:
Here is a sample of the Column Statistics Dashboard: There are three panels show one worksheet for a sample “Customers” file.
Complete Column Dashboard:
In the example below you see an Excel Worksheet that contains a pivot table allowing you to examine a columns patterns , in this cse Zip code, and subsequently drill into the actual values related to one of the patterns. Notice the Zip code example, we will review the pattern “9999”, or Zip code with only 4 numeric digits. When you click on the pattern og “9999” the actual value is revealed is
This paper will outline the advantages of developing your complete Business Analytics solution within the Microsoft suite of Business Intelligence capabilities, including Microsoft SQL Server(SSIS, SSAS, SSRS and Microsoft Office).
This approach will yield and very powerful and detailed Analytical application focused on your business and it particular needs both data sources and business rules. In addition to a reporting and analytical solution, you will have insight and access to the data and business rule “Lineage” or DNA allowing you click on any anomalies in any dash board and not only drill into the detail, but also see how data was transformed and/or changes as it was process to create your dashboard, excel workbook or report.
Lineage as it applies to data in Business Intelligence provides the capability see track back where data came from. Essentially it supports “Metric Decomposition” which is a process for breaking down a metric (business calculation) into separate parts. With this capability you can determine your data’s ancestry or Data DNA.
An “Absolute Truth” in today’s environment (Business Discovery, Big Data, Self Service Whatever?) is that you never have enough detail and usually have to stop clicking before you have an answer to your question and look at another report of application. Just when you about to get to the “bottom of it” you run out of clicks As Donald Farmer (VP Product Management) pointed out recently we are hunters and that’s why we like search engines. We enter phrases for things were hunting, and the search engine presents large list which we then follow the “tracks” of what we ae looking for , if we go down a wrong path we back track and continue, continually getting closer to our answer. Search engines implement lineage thru “key words “ and there relationship to web pages, which intern link to other pages. This capability is now “heuristic” or common sense.
While the method of analyzing data is “Rule of Thumb” for internet searches, it is not in terms of analyzing your business reports.
If you were able to “link” all your data from it the original source to the final destination, then would you have the ability to “hunt” through your data the way you hunt with Google, Bing etc…
Let apply this concept to customer records with phone numbers (phone1 column). First we clean the phone number (phone1Cleansed column). We keep both columns in our processing results. Then in addition we keep the name of the “Business Rule we used to clean it (phone1_Category), an indicator to identify valid or invalid phone numbers(phone1_Valid) and the actual column name used as a source(phone1_ColumnName). The “phone1_Rule” column is used to tell us which rule was used, the first row used only the Phone Parse rule to format the number, the second row used two rules, NumbersOnly tells us thet the number contains non numeric character and the Phone Parse rule formats the number with periods separating the area code, prefix and suffix.
The end result is when this is included in your reporting or analytical solution you know:
In addition if you then choose to implement data mining and/or “Predictive Analytics” already a part of the Microsoft BI Suite, you will be facing the typical pitfalls associated with trying to predict based on either “dirty data” or highly “scrubbed data”, lacking lineage.
I have assembled a collection of vendors product that when used in conjunction with the Microsoft BI suite can provide this capability at reasonable in line with your investment with Microsoft SQL Server.
If done manually this would require extensive additional coding, however with the tools we have selected, this can be accomplished automatically as well as automating the loading of an Analysis Services cube for Analytics.
Our solution brings together the Microsoft offerings of Actuality Business Intelligence, Melissa Data and HaloBI. This solution can be implemented by any moderately skilled Microsoft developer familiar with the BI Suite(SSIS and SSAS) and does not require any expensive niche ETL or Analytical software.
In the next post I will walk through a real world implementation.
Microsoft SSIS Package
Data Lineage SSIS Example using ABI Profiling, Melissa Data , SSIS Fuzzy Groupng, SSAS cube genertion
New post Kimball ETL (subsystem 1) Data Profiling via TSQL http://bit.ly/LIJksn
#dataquality #dataprofiling @sqlservercentrl