Category Archives: SSIS

Merry Christmas Data Classification, Feature Engineering , Data Governance. ‘How to’ do it and some code take a look

I was heavily involved in business intelligence, data warehousing and data governance as of several years ago and recently have had many chaotic personal challenges, upon returning to professional practice I have discovered things have not changed that much in 10 yearsagovernance The methodologies and approaches are still relatively consistent however the tools and techniques have changed and In my opinion not for the better, without focusing on specific tools I’ve observed that the core to data or MDM is enabling and providing a capability for classifying data into business categories or nomenclature.. and it has really not improved.

This basic traditional approach has not changed, in essence man AI model predicst a Metric and is wholly based on the integrity of its features or Dimensions.

Therefore I decided, to update some of the techniques and code patterns, I’ve used in the past regarding the information value chain and or record linkage , and we are going to make the results available with associated business and code examples initially with SQL Server and data bricks plus python

My good friend, Jordan Martz of DataMartz fame has greatly contrinuted to this old mans BigData enlightenment as well as Craig Campbell in updating some of the basic classification capabilities required and critical for data governance. If you would like a more detailed version of the source as well as the test data, please send me an email at iwhiteside@msn.com. Stay tuned for more update and soon we will add Neural Network capability for additional automation of “Governance Type” automated classification and confidence monitoring.

Before we focus on functionality let’s focus on methodology

Initially understand key metrics to be measured/KPI‘s their formulas and of course teh businesse’s expectation of their calculations

Immediately gather file sources and complete profiling as specified in my original article found here

Implementing the processes in my meta-data mart article would provide numerous statistics regarding integers or float field however there are some special considerations for text fields or smart codes

Before beginning classification you would employ similarity matching or fuzzy matching as described here

As I said I posted the code for this process on SQL Server Central 10 years ago here is s Python Version.

databricks-logo Roll You Own – Python Jaro_Winkler(Python)

databricks-logoroll You Own – Python Jaro_Winkler(Python)
Import Notebook
Step 1a – import pandas
import pandas

Step 2 – Import Libraries

libraries from pyspark.sql.functions import input_file_name

from pyspark.sql.types import *
import datetime, time, re, os, pandas

ML Libraires

from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, NGram, HashingTF, IDF, Word2Vec, Normalizer, Imputer, VectorAssembler
from pyspark.ml import Pipeline

import mlflow

from mlflow.tracking import MLFlowClient

from sklearn.cluster import KMeans
import numpy as np
Step 3 – Test JaroWinkler

JaroWinkler(‘TRAC’,’TRACE’)
Out[5]: 0.933333
Step 4a =Implement JaroWinkler(Fuzzy Matching)

%python

def JaroWinkler(str1_in, str2_in):

if(str1_in is None or str2_in is None):
return 0

tr=0
common=0
jaro_value=0
len_str1=len(str1_in)
len_str2=len(str2_in)
column_names=[‘FID’,’FStatus’]

df_temp_table1 = pandas.DataFrame(columns=column_names)
df_temp_table2 = pandas.DataFrame(columns=column_names)

#clean_string(str1_in)
#clean_string(str2_in)

if len_str1 > len_str2:
swap_len=len_str2
len_str2=len_str1
len_str1=swap_len
swap_str=str1_in
str1_in=str2_in
str2_in=swap_str

max_len=len_str2

iCounter=1

while(iCounter <= len_str1):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table1=pandas.concat([df_temp_table1,df], ignore_index=True)
iCounter=iCounter+1

iCounter=1

while (iCounter <= len_str2):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table2=pandas.concat([df_temp_table2,df], ignore_index=True)
iCounter=iCounter+1

iCounter=1
m=round((max_len/2)-1,0)
i=1

while(i <= len_str1):
a1=str1_in[i-1]

if m >= i:
  f=1
  z=i+m 
else:
  f=i-m
  z=i+m 
if z > max_len:
  z=max_len
while (f <= z):
  a2=str2_in[int(f-1)]
  if(a2==a1 and df_temp_table2.loc[f-1].at['FStatus']==0):
    common = common + 1
    df_temp_table1.at[i-1,'FStatus']=1
    df_temp_table2.at[f-1,'FStatus']=1 
    break
  f=f+1
i=i+1

i=1
z=1
while(i <= len_str1): v1Status=df_temp_table1.loc[i-1].at[‘FStatus’] if(v1Status==1): while(z <= len_str2): v2Status=df_temp_table2.loc[z-1].at[‘FStatus’] if(v2Status==1): a1=str1_in[i-1] a2=str2_in[z-1] z=z+1 if(a1 != a2 ): tr = tr+0.5 break break i=i+1 wcd = 1.0/3.0 wrd = 1.0/3.0 wtr = 1.0/3.0 if (common != 0): jaro_value = (wcd * common)/ len_str1 + (wrd * common) / len_str2 + (wtr * (common – tr)) / common return round(jaro_value,6) Step 4b – Register JaroWinkler spark.udf.register(“JaroWinkler”, JaroWinkler) Out[6]:
Step8a – Bridge vs Master vs AssociativeALL

%sql
DROP TABLE IF EXISTS NameAssociative;
CREATE TABLE NameAssociative;

SELECT
Name
,NameInput

,sha2(replace ( NameLookput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameLookupCeaned ,a.NameLookupKey
,sha2(replace( NameInput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameInput,b.NameInputKey
,JaroWinkler(a.NameLookup, b.NameInput) MatchScor
,RANK() OVER (Partition by a.DetailedBUMaster ORDER BY JaroWinkler(a.NameLookupCleande, b.NameInputCleaned) DESC) NameLookup,b.NameLookupKey
FROM NameInput as a
CROSS JOIN NameLookup as b

Self Service Semantic BI (Business Intelligence) Concept

Leave a reply

We want to build an Enterprise Analytical capability by integrating the concepts for building a Metadata Mart with the facilities for the Semantic Web

Metadata Mart Source
- (Metadata Mart as is) Source Profiling(Column, Domain & Relationship)
+
- (Metadata Mart Plus Vocabulary(Metadata Vocabulary)) Stored as Triples(subject-predicate-object) (SSIS Text Mining)
+
- (Metadata Mart Plus)Create Metadata Vocabulary following RDFa applied to Metadata Mart Triple(SSIS Text Mining+ Fuzzy (SPARGL maybe))
+
Bridge to RDFa – JSON-LD via Schema.org
- Master data Vocabulary with lineage (Metadata Vocabulary + Master Vocabulary) mapped to MetaContent Statements)) based on person.schema.org
- Creates link to legacy data in data warehouse
- +RDFa applied to web pages
- +JSON-LD applied to
- + any Triples from any source

Semantic Self Service BI
- Metadata Mart Source + Bridge to RDFa

I have spent some time in this for quite a while now and I believe there is a quite a bit of merit in approaching the collection of domain data and column profile data, in regards to the meta-data mart, and organize them in a triple’s fashion

The basis for JSON-LD and RDFa is the collection of data as a triple. Delving into said deeper

I believe with the proper mapping for the object reference and deriving of the appropriate predicates in the collection of the value we could gain some of the same benefits as well as bringing the web data being “collected, there by linking to source data.

Consider the following excerpt regarding Vocabularies derived from MetaContent via Metadata Structure

“Metadata structures[edit]

Metadata (metacontent), or more correctly, the vocabularies used to assemble metadata (metacontent) statements, are typically structured according to a standardized concept using a well-defined metadata scheme, including: metadata standards and metadata models. Tools such as controlled vocabularies, taxonomies, thesauri, data dictionaries, and metadata registries can be used to apply further standardization to the metadata. Structural metadata commonality is also of paramount importance in data model development and in database design.

Metadata syntax[edit]

Metadata (metacontent) syntax refers to the rules created to structure the fields or elements of metadata (metacontent).^[11] A single metadata scheme may be expressed in a number of different markup or programming languages, each of which requires a different syntax. For example, Dublin Core may be expressed in plain text, HTML, XML, and RDF.^[12]

A common example of (guide) metacontent is the bibliographic classification, the subject, the Dewey Decimal class number. There is always an implied statement in any “classification” of some object. To classify an object as, for example, Dewey class number 514 (Topology) (i.e. books having the number 514 on their spine) the implied statement is: “<book><subject heading><514>. This is a subject-predicate-object triple, or more importantly, a class-attribute-value triple. The first two elements of the triple (class, attribute) are pieces of some structural metadata having a defined semantic. The third element is a value, preferably from some controlled vocabulary, some reference (master) data. The combination of the metadata and master data elements results in a statement which is a metacontent statement i.e. “metacontent = metadata + master data”. All these elements can be thought of as “vocabulary”. Both metadata and master data are vocabularies which can be assembled into metacontent statements. “

The MetadataMart serve as the source for both metadata vocabulary and MDM for the Master Data Vocabulary.

For the Master Data Vocabulary consider schema.org which defines most of the schemas we need. Consider the following schema.org Persons Properties of Objects and Predicates:

Thing > Person

A person (alive, dead, undead, or fictional).

Property	Expected Type	Description
Properties from Person
additionalName	Text	An additional name for a Person, can be used for a middle name.
address	PostalAddress	Physical address of the item.

The key is to link source data in the Enterprise via a Business Vocabulary from MDM to the Source Data Metadata Vocabulary from a Metadata Mart to conform the triples collected internally and externally.

In essence information from the web applications can be integrated with the dimensional metadata mart, MDM Model and existing Data Warehouses providing lineage for selected raw data from web to Enterprise conformed Dimensions that have gone thru Data Quality processes.

Please let me know your thoughts.

Mastering Microsoft SQL Server tools required for EIM (MDS, DQS, Profiling and SSIS) – Complete Kit

1 Reply

Overview

Understanding and mastering the various Microsoft SQL Server Tools available for Data Quality , Master Data Management , Fuzzy Matching and Enterprise Information Management can be daunting and exasperating. In the hopes of reducing the anxiety and frustration I am providing a practical roadmap using openly available resources and requiring only a few days to complete.

I have reviewed and organized several articles and blog post that would allow you within a few days to master the skills required to utilize several SQL Server 2012 capabilities mentioned above.

From a business perspective where to be covering the following:

Basic

Master Data Management (MDS)

Data Quality (DQS)

Integration (SSIS)

Advanced

Fuzzy Matching(data Deduplication)

Data Profiling (Source analysis against business rules)

Basic EIM

Master Data Services

Nick Barclay: BI-Lingual: MDS Architecture Notes

This article will provide overview of the MDS architecture in support of MDM (Master Data Management)

http://nickbarclay.blogspot.com/2009/12/mds-architecture-notes.html

Nick Barclay: BI-Lingual: Beginning Master Data Services (Part 1 thru 7)

This article will teach you the steps for implementing Microsoft MDS as well as the tasks required to develop and load a Model. Complete all exercises in order. MDS is the tool Microsoft provide to support creating and maintaining reference data set (Lookup and Code tables) in support of Master Data Management.

http://nickbarclay.blogspot.com/2009/11/beginning-master-data-services-part-1.html

Data Quality Services

Enterprise Information Management using SSIS, MDS, and DQS Together [Tutorial]

This next article will teach you how to implement and utilize the Data Quality cleansing and matching capabilities.

http://technet.microsoft.com/en-us/library/jj819782.aspx

Fuzzy Matching and Deduplication

Advanced SSIS Fuzzy Matching via Record Linkage Methodology – SQLServerCentral

This article will teach you the concepts and methodology recommended for Fuzzy Matching or Deduplication, in the context of a well known Record Linkage Methodology.

http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/71486/

Advanced Matching and Data Profiling

I have also included the next two articles to further explore the code and capabilities to solve complex matching and deduplication efforts

Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) – T-SQL – SQLServerCentral

http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/

Roll Your Own SSIS Fuzzy Matching / Grouping SSIS (Jaro – Winkler) – SQLServerCentral

http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/65616/

Creating a Metadata Mart via TSQL – Complete Data Profiling Kit – Download

https://irawarrenwhiteside.com/2014/04/13/creating-a-metadata-mart-via-tsql/

Fast, Data Driven Netezza Load – The Prittie Good Approach

Leave a reply

A few years ago I have a client that required migrating and re hosting a data warehouse from the Oracle environment to the Netezza Appliance. The application had hundreds of tables and almost 2900 Source to Target Mappings. This would have taken 1000′s of hours of manual coding. For this client we implemented a reusable data driven architecture that relys on a metadata mart. The end result a greatly reduced TCO for generating the code required vs manual cosing. In addition the process leaves behind a metadata mart to report on for auditing and other Data Governance efforts. We generated over 100,000 lines of code and over 3000 load scripts.After implementation the application can be maintained via the tables and automatically regenerated, providing a pro active or self healing application architecture to respond to changes in incoming source files or target changes. Code Here is a version for Oracle Loader

Prittie Good Approach: The objective is to “tactically” enable the clients analytical and reporting tools to their data via the Netezza platform very fast, with minimal coding effort, but in a “strategic fashion”. The “Prittie” approach will load a client’s data into the Netezza platform as fast and inexpensive as possible while still providing a reusable “production level” set of code for implementation. The approach as defined is to minimize upfront source analyses and data profiling and leverage the “speed” of Netezza , while applying specific coding techniques developed by you to automatically generate a the SQL code necessary to import data into the Netezza environment from industry standard delimited files with headers and a mapping document. The application will build a Metadata Mart for support data driven code generation and as a tactical tool for Data Governance. The primary TCO driver or value add gained with using our approach is to significantly reduced time and cost for an initial load , data profiling and evaluation of data quality, and speed up the process of Business Analysis, as well as leveraging an iterative methodology. In addition business data profiling as well as source analysis can then be accomplished on the Netezza platform, very quickly, while capturing critical load “halting” problems as well as basic profiling metadata immediately and resolving them.

Target: Tables and Columns are derived from parsing machine generated DDL for existing tables, for table columns formats and column order.
Source: Filenames and Column are derived from parsing Header files extracted from industry standard “tab” delimited files with Headers, this provide columns contained in the files and there order.
Mapping: (Source to Target) are derived from parsing the “Bill of Lading” file containing cross reference of multiple filenames to target tables and a grouping column in this case. You original file also had external name this is no longer needed;

Prittie Approach Specific Logic: From these three inputs the application generates Netezza ANSI Standard SQL Code that create an external table with columns ordered by the incoming filename and associated with the DDL Column format, and the associated Netezza logic for external filename allocation and logging. A simple SQL statement is generated that checks for duplicates and data errors , that can be logged. In addition I have includde the ability to generate the Drop and Insert Logic to complete the “Prittie” Import and Load Architecture for Netezza.

Application Stability: The Source and Mapping files design are relatively stable. The Target is based on some of the idiosyncrasies found in the DDL created by the query tool used, with some additional effort this code can be enhanced to handle some “looseness”. The issue is in how the code (DDL) is generated and by what tool. The rest of the application is shielded from this by the creation of a “Source-Target-Mapping Cross Reference” table that is used in the actual code generation.

ETL Potential: As we have discussed the application could be extended easily to handle basic ETL. This could be accomplished by including one more that that contained transformations for specific table columns combinations or data types. An example would be to include a new mapping table that converted Oracle Datetime to ANSI as required. This is a simple example but the underlying design could handle relatively complex transformations. We could also move the Location and Logging Directory to a table to further make the application Metadata Driven.

Doug Prittie and Ira Warren Whiteside

Microsoft BI Stack Leveraged with Vendor Add – In’s for Data Quality and Data Lineage.

Leave a reply

Low cost strategic Business Analytics and Data Quality

This paper will outline the advantages of developing your complete Business Analytics solution within the Microsoft suite of Business Intelligence capabilities, including Microsoft SQL Server(SSIS, SSAS, SSRS and Microsoft Office).

This approach will yield and very powerful and detailed Analytical application focused on your business and it particular needs both data sources and business rules. In addition to a reporting and analytical solution, you will have insight and access to the data and business rule “Lineage” or DNA allowing you click on any anomalies in any dash board and not only drill into the detail, but also see how data was transformed and/or changes as it was process to create your dashboard, excel workbook or report.

Lineage as it applies to data in Business Intelligence provides the capability see track back where data came from. Essentially it supports “Metric Decomposition” which is a process for breaking down a metric (business calculation) into separate parts. With this capability you can determine your data’s ancestry or Data DNA.

Why does this matter? The key to Lower Cost of Ownership, that’s why.

An “Absolute Truth” in today’s environment (Business Discovery, Big Data, Self Service Whatever?) is that you never have enough detail and usually have to stop clicking before you have an answer to your question and look at another report of application. Just when you about to get to the “bottom of it” you run out of clicks As Donald Farmer (VP Product Management) pointed out recently we are hunters and that’s why we like search engines. We enter phrases for things were hunting, and the search engine presents large list which we then follow the “tracks” of what we ae looking for , if we go down a wrong path we back track and continue, continually getting closer to our answer. Search engines implement lineage thru “key words “ and there relationship to web pages, which intern link to other pages. This capability is now “heuristic” or common sense.

While the method of analyzing data is “Rule of Thumb” for internet searches, it is not in terms of analyzing your business reports.

If you were able to “link” all your data from it the original source to the final destination, then would you have the ability to “hunt” through your data the way you hunt with Google, Bing etc…

Lineage Example

ABI Data Lineage

Let apply this concept to customer records with phone numbers (phone1 column). First we clean the phone number (phone1Cleansed column). We keep both columns in our processing results. Then in addition we keep the name of the “Business Rule we used to clean it (phone1_Category), an indicator to identify valid or invalid phone numbers(phone1_Valid) and the actual column name used as a source(phone1_ColumnName). The “phone1_Rule” column is used to tell us which rule was used, the first row used only the Phone Parse rule to format the number, the second row used two rules, NumbersOnly tells us thet the number contains non numeric character and the Phone Parse rule formats the number with periods separating the area code, prefix and suffix.

ABI SSIS validation example with lineage

The end result is when this is included in your reporting or analytical solution you know:

Who: The process that applied the change, if you also log package name(Optional)
What: The final corrected or standardized value(phone1Cleansed)
Where: Which column originated the value.(phone1_ColumnName)
When: The date and time of the change.(Optional)
Why: Which rules were invoked to cause the transformation.(phone1_Category, phone1_Rule and phone1_Valid)

In addition if you then choose to implement data mining and/or “Predictive Analytics” already a part of the Microsoft BI Suite, you will be facing the typical pitfalls associated with trying to predict based on either “dirty data” or highly “scrubbed data”, lacking lineage.

I have assembled a collection of vendors product that when used in conjunction with the Microsoft BI suite can provide this capability at reasonable in line with your investment with Microsoft SQL Server.

If done manually this would require extensive additional coding, however with the tools we have selected, this can be accomplished automatically as well as automating the loading of an Analysis Services cube for Analytics.

Our solution brings together the Microsoft offerings of Actuality Business Intelligence, Melissa Data and HaloBI. This solution can be implemented by any moderately skilled Microsoft developer familiar with the BI Suite(SSIS and SSAS) and does not require any expensive niche ETL or Analytical software.

In the next post I will walk through a real world implementation.

Microsoft SSIS Package

Data Lineage SSIS Example using ABI Profiling, Melissa Data , SSIS Fuzzy Groupng, SSAS cube genertion

Actuality Business Intelligence SSIS Profiling

ABI SSIS Data Profiling

Melissa Data Contact Verify

SQL Saturday New York #158

Leave a reply

Last Saturday August 4, 2012 we attended SQL Saturday #158 in New York.

We (Tessie, Victoria and Brandon) were there educating folks on SSIS Data Quality for Melissa Data.

For Actuality Business Intelligence and Melissa Data the event planning staff executed a great event and very well organized. Here is a video of Tessie and Victoria(Daughter) in action.

All in all we had a full day of interacting with SQL Saturday New Yorkers and also giving presentation Advanced Fuzzy Matching(TSQL, SSIS and MDS).

Last we also brought our grandchildren(Julia, Jack an Jake), and they were excited to have the chance to begin the processes of becoming future BI Consultants.

Secondofly here is a video of my grandaughter(Julia) reminding me from here persepctive “I’m a pretty princess, not you”

Thirdofly Julia and Tessie with Lady Liberty.

Carlier John Lauer 1963 – 2011

Leave a reply

I’d like to offer our prayers, remembrance and celebrate the life and friendship of Carlier John Lauer. John passed away yesterday morning.

Ira & Tessie

SSIS Dynamically Map Column Data Based on Column Pattern Profiling

Leave a reply

SSIS Dynamically Map Column Data Based on Column Pattern Profiling

Problem

The problem set here is what to do when faced with an input file that has columns with multiple types of content(Domains). For instance a single column may contain address, email, name or city state zip..

This example stems from a problem I faced recently with some banking data. We needed to load data into a Customer Dimension and MDM Tool and needed to get the data organized, in this case 8 input columns each with a different part of the address, name , company name or email. The goal was to send the data through an address correction tool as well, however simply concatenating all the columns was not an alternative.

Input File with multiple type of content in same column. The sample data was derived form the 2008 Adventure Works Sample. The TSQL for generating the sample is at the end of the article.

Approach

From a data profiling perspective the approach would be to identify if a column matches a particular pattern (address, email, city state zip, etc…) and then move the data into the appropriate labeled column

I have worked with Regular Expression in SSIS and decided to use them in a Script Component to acomplish the column pattern identification and then a Derived Column Transform to actually do the dynamic mapping.

Organized and properly mapped results.

The major considerations are to add the necessary output columns as Booleans in the Script Component Editor. I have provided the input file and package.

Code

Here is the script. i did rely on Expresso to test the final Regular Expression.

‘ Microsoft SQL Server Integration Services Script Component
‘ Write scripts using Microsoft Visual Basic 2008.
‘ ScriptMain is the entry point class of the script.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Dim patAlphaUpper As String = "[^a-z0-9 ,@%&/#’.-]"
Dim patAlphaLower As String = "[^A-Z0-9 ,@%&/#’.-]"
Dim patNum As String = "[^A-Za-z ,@%&/#’.-]"
Dim patSpecial As String = "[^A-Za-z0-9@&#]"
‘Dim patEmail As String = "b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}b"
Dim patEmail As String = "([a-zA-Z0-9_-.]+)@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})"
Dim patUSPhone As String = "^(?:(?<1>[(])?(?<AreaCode>[1-9]d{2})(?(1)[)])(?(1)(?<2>[ ])|(?:(?<3>[-])|(?<4>[ ])))?)?(?<Prefix>[1-9]d{2})(?(AreaCode)(?:(?(1)(?(2)[- ]|[-]?))|(?(3)[-])|(?(4)[- ]))|[- ]?)(?<Suffix>d{4})$"
Dim patAddress As String = "w*s*d+s+w+s*w*"
Dim patCityStateZip As String = "w+s+w+s+d{5}"
‘Dim patPOBOX As String = "[poPO]+[BbOoXx]+s*d+"
Dim patPOBOX As String = "^b[P|p]*(OST|ost)*.*s*[O|o|0]*(ffice|FFICE)*.*s*[B|b][O|o|0][X|x]bs+d+s*$"
Dim rgxAlphaUpper As New Text.RegularExpressions.Regex(patAlphaUpper)
Dim rgxAlphalower As New Text.RegularExpressions.Regex(patAlphaLower)
Dim rgxSpecial As New Text.RegularExpressions.Regex(patSpecial)
Dim rgxNum As New Text.RegularExpressions.Regex(patNum)
Dim rgxEmail As New Text.RegularExpressions.Regex(patEmail)
Dim rgxPhone As New Text.RegularExpressions.Regex(patUSPhone)
Dim rgxAddress As New Text.RegularExpressions.Regex(patAddress)
Dim rgxCityStateZip As New Text.RegularExpressions.Regex(patCityStateZip)
Dim rgxPOBOX As New Text.RegularExpressions.Regex(patPOBOX)
Dim tagAlphaUpper As String
Dim tagAlphaLower As String
Dim tagNumber As String
Dim ADDR1Work As String
Dim ADDR2Work As String
Dim ADDR3Work As String
Dim ADDR4Work As String
Dim ADDR5Work As String
Dim ADDR6Work As String
Dim ADDR7Work As String
Dim ADDR8Work As String

Public Overrides Sub PreExecute()
MyBase.PreExecute()
‘
‘ Add your code here for preprocessing or remove if not needed
‘
End Sub

Public Overrides Sub PostExecute()
MyBase.PostExecute()
‘
‘ Add your code here for postprocessing or remove if not needed
‘ You can set read/write variables here, for example:
‘ Me.Variables.MyIntVar = 100
‘
End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

ADDR1Work = rgxSpecial.Replace(Row.ADDRLNE1TXT, " ")
ADDR2Work = rgxSpecial.Replace(Row.ADDRLNE2TXT, " ")
ADDR3Work = rgxSpecial.Replace(Row.ADDRLNE3TXT, " ")
ADDR4Work = rgxSpecial.Replace(Row.ADDRLNE4TXT, " ")
ADDR5Work = rgxSpecial.Replace(Row.ADDRLNE5TXT, " ")
ADDR6Work = rgxSpecial.Replace(Row.ADDRLNE6TXT, " ")
ADDR7Work = rgxSpecial.Replace(Row.ADDRLNE7TXT, " ")
ADDR8Work = rgxSpecial.Replace(Row.ADDRLNE8TXT, " ")

Row.ADDR1Address = rgxAddress.IsMatch(ADDR1Work)
Row.ADDR2Address = rgxAddress.IsMatch(ADDR2Work)
Row.ADDR3Address = rgxAddress.IsMatch(ADDR3Work)
Row.ADDR4Address = rgxAddress.IsMatch(ADDR4Work)
Row.ADDR5Address = rgxAddress.IsMatch(ADDR5Work)
Row.ADDR6Address = rgxAddress.IsMatch(ADDR6Work)
Row.ADDR7Address = rgxAddress.IsMatch(ADDR7Work)
Row.ADDR8Address = rgxAddress.IsMatch(ADDR8Work)
Row.ADDR1CityStateZip = rgxCityStateZip.IsMatch(ADDR1Work)
Row.ADDR2CityStateZip = rgxCityStateZip.IsMatch(ADDR2Work)
Row.ADDR3CityStateZip = rgxCityStateZip.IsMatch(ADDR3Work)
Row.ADDR4CityStateZip = rgxCityStateZip.IsMatch(ADDR4Work)
Row.ADDR5CityStateZip = rgxCityStateZip.IsMatch(ADDR5Work)
Row.ADDR6CityStateZip = rgxCityStateZip.IsMatch(ADDR6Work)
Row.ADDR7CityStateZip = rgxCityStateZip.IsMatch(ADDR7Work)
Row.ADDR8CityStateZip = rgxCityStateZip.IsMatch(ADDR8Work)
Row.ADDR1POBOX = rgxPOBOX.IsMatch(ADDR1Work)
Row.ADDR2POBOX = rgxPOBOX.IsMatch(ADDR2Work)
Row.ADDR3POBOX = rgxPOBOX.IsMatch(ADDR3Work)
Row.ADDR4POBOX = rgxPOBOX.IsMatch(ADDR4Work)
Row.ADDR5POBOX = rgxPOBOX.IsMatch(ADDR5Work)
Row.ADDR6POBOX = rgxPOBOX.IsMatch(ADDR6Work)
Row.ADDR7POBOX = rgxPOBOX.IsMatch(ADDR7Work)
Row.ADDR8POBOX = rgxPOBOX.IsMatch(ADDR8Work)
Row.ADDR1Email = rgxEmail.IsMatch(Row.ADDRLNE1TXT)
Row.ADDR2Email = rgxEmail.IsMatch(Row.ADDRLNE2TXT)
Row.ADDR3Email = rgxEmail.IsMatch(Row.ADDRLNE3TXT)
Row.ADDR4Email = rgxEmail.IsMatch(Row.ADDRLNE4TXT)
Row.ADDR5Email = rgxEmail.IsMatch(Row.ADDRLNE5TXT)
Row.ADDR5Email = rgxEmail.IsMatch(Row.ADDRLNE6TXT)
Row.ADDR7Email = rgxEmail.IsMatch(Row.ADDRLNE7TXT)
Row.ADDR8Email = rgxEmail.IsMatch(Row.ADDRLNE8TXT)

End Sub

End Class

After the script component applys the pattern matching(Regular Expression) we use the Derived Column with a series of Condition Expression to populat each column. Here is a sample for mapping the address column Address , the rest are included in the attached package.

Derived Column code:

ADDR_1_Address == TRUE && ADDR_1_CityStateZip == FALSE ? ADDR_LNE_1_TXT : ADDR_2_Address == TRUE && ADDR_2_CityStateZip == FALSE ? ADDR_LNE_2_TXT : ADDR_3_Address == TRUE && ADDR_3_CityStateZip == FALSE ? ADDR_LNE_3_TXT : ADDR_4_Address == TRUE && ADDR_4_CityStateZip == FALSE ? ADDR_LNE_4_TXT : ADDR_5_Address == TRUE && ADDR_5_CityStateZip == FALSE ? ADDR_LNE_5_TXT : ADDR_6_Address == TRUE && ADDR_6_CityStateZip == FALSE ? ADDR_LNE_6_TXT : ADDR_7_Address == TRUE && ADDR_7_CityStateZip == FALSE ? ADDR_LNE_7_TXT : ADDR_8_Address == TRUE && ADDR_8_CityStateZip == FALSE ? ADDR_LNE_8_TXT :

"No Address"

TSQL for generating sample input:

SELECT TOP (15)
CAST(CustomerKey as Varchar(50)) as ADDR_LNE_1_TXT
, ISNULL(DimCustomer.FirstName,’ ‘ ) + ‘ ‘ + ISNULL(DimCustomer.MiddleName,” ) + ‘ ‘ + ISNULL(DimCustomer.LastName,’ ‘ ) as ADDR_LNE_2_TXT
, ISNULL(DimCustomer.AddressLine1,’ ‘ ) + ‘ ‘ + ISNULL(DimCustomer.AddressLine2,” )as ADDR_LNE_3_TXT
,DimCustomer.EmailAddress as ADDR_LNE_4_TXT
, ISNULL(DimGeography.City,’ ‘ ) + ‘ ‘ + ISNULL(DimGeography.StateProvinceCode,” ) + ‘ ‘ + ISNULL(DimGeography.PostalCode,” )as ADDR_LNE_5_TXT
, StateProvinceName as ADDR_LNE_6_TXT
, DimGeography.CountryRegionCode as ADDR_LNE_7_TXT
, EnglishEducation as ADDR_LNE_8_TXT
INTO ADDRESS_PARSE_DEMO
FROM DimCustomer INNER JOIN
DimGeography ON DimCustomer.GeographyKey = DimGeography.GeographyKey
Where CustomerKey in ( 11021 , 11022 ,11023 ,11024 ,11036)
Union All
SELECT TOP (15)
EnglishEducation as ADDR_LNE_1_TXT
, ISNULL(DimCustomer.FirstName,’ ‘ ) + ‘ ‘ + ISNULL(DimCustomer.MiddleName,” ) + ‘ ‘ + ISNULL(DimCustomer.LastName,’ ‘ ) as ADDR_LNE_2_TXT
, ISNULL(DimCustomer.AddressLine1,’ ‘ ) + ‘ ‘ + ISNULL(DimCustomer.AddressLine2,” )as ADDR_LNE_3_TXT
,DimCustomer.EmailAddress as ADDR_LNE_4_TXT
, ISNULL(DimGeography.City,’ ‘ ) + ‘ ‘ + ISNULL(DimGeography.StateProvinceCode,” ) + ‘ ‘ + ISNULL(DimGeography.PostalCode,” )as ADDR_LNE_5_TXT
, StateProvinceName as ADDR_LNE_6_TXT
, DimGeography.CountryRegionCode as ADDR_LNE_7_TXT
, CAST(CustomerKey as Varchar(50)) as ADDR_LNE_8_TXT

—INTO ADDRESS_PARSE_DEMO
FROM DimCustomer INNER JOIN
DimGeography ON DimCustomer.GeographyKey = DimGeography.GeographyKey
Where CustomerKey in (11021 , 11022 ,11023 ,11024 ,11036)
Union All
SELECT TOP (15)
ISNULL(DimGeography.City,’ ‘ ) + ‘ ‘ + ISNULL(DimGeography.StateProvinceCode,” ) + ‘ ‘ + ISNULL(DimGeography.PostalCode,” )as ADDR_LNE_1_TXT
, StateProvinceName as ADDR_LNE_2_TXT
, DimGeography.CountryRegionCode as ADDR_LNE_3_TXT
, CAST(CustomerKey as Varchar(50)) as ADDR_LNE_4_TXT
, EnglishEducation as ADDR_LNE_5_TXT
, ISNULL(DimCustomer.FirstName,’ ‘ ) + ‘ ‘ + ISNULL(DimCustomer.MiddleName,” ) + ‘ ‘ + ISNULL(DimCustomer.LastName,’ ‘ ) as ADDR_LNE_6_TXT
, ISNULL(DimCustomer.AddressLine1,’ ‘ ) + ‘ ‘ + ISNULL(DimCustomer.AddressLine2,” )as ADDR_LNE_7_TXT
,DimCustomer.EmailAddress as ADDR_LNE_8_TXT

FROM DimCustomer INNER JOIN
DimGeography ON DimCustomer.GeographyKey = DimGeography.GeographyKey
Where CustomerKey in ( 11040 ,11041 ,11042 ,11043, 11049)

The SSIS Package

Ira Warren Whiteside

Actuality Business Intelligence

"karo yaa na karo, koshish jaisa kuch nahi hai"

"Do, or do not. There is no try."

Ira Warren Whiteside's Blog- Information Sherpa

Perception is Perception “Awareness is Reality”

Category Archives: SSIS

Merry Christmas Data Classification, Feature Engineering , Data Governance. ‘How to’ do it and some code take a look

libraries from pyspark.sql.functions import input_file_name

ML Libraires

import mlflow

from mlflow.tracking import MLFlowClient

Self Service Semantic BI (Business Intelligence) Concept

Mastering Microsoft SQL Server tools required for EIM (MDS, DQS, Profiling and SSIS) – Complete Kit

Overview

Basic EIM

Master Data Services

Nick Barclay: BI-Lingual: MDS Architecture Notes

Nick Barclay: BI-Lingual: Beginning Master Data Services (Part 1 thru 7)

Data Quality Services

Enterprise Information Management using SSIS, MDS, and DQS Together [Tutorial]

Fuzzy Matching and Deduplication

Advanced SSIS Fuzzy Matching via Record Linkage Methodology – SQLServerCentral

Advanced Matching and Data Profiling

Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) – T-SQL – SQLServerCentral

Roll Your Own SSIS Fuzzy Matching / Grouping SSIS (Jaro – Winkler) – SQLServerCentral

Fast, Data Driven Netezza Load – The Prittie Good Approach

Microsoft BI Stack Leveraged with Vendor Add – In’s for Data Quality and Data Lineage.

Low cost strategic Business Analytics and Data Quality

Why does this matter? The key to Lower Cost of Ownership, that’s why.

Lineage Example

Actuality Business Intelligence SSIS Profiling

Melissa Data Contact Verify

SQL Saturday New York #158

Carlier John Lauer 1963 – 2011

SSIS Dynamically Map Column Data Based on Column Pattern Profiling

SSIS Dynamically Map Column Data Based on Column Pattern Profiling

Problem

Approach

Code

TSQL for generating sample input:

The SSIS Package