I was heavily involved in business intelligence, data warehousing and data governance as of several years ago and recently have had many chaotic personal challenges, upon returning to professional practice I have discovered things have not changed that much in 10 yearsagovernance The methodologies and approaches are still relatively consistent however the tools and techniques have changed and In my opinion not for the better, without focusing on specific tools I’ve observed that the core to data or MDM is enabling and providing a capability for classifying data into business categories or nomenclature.. and it has really not improved.

- This basic traditional approach has not changed, in essence man AI model predicst a Metric and is wholly based on the integrity of its features or Dimensions.
Therefore I decided, to update some of the techniques and code patterns, I’ve used in the past regarding the information value chain and or record linkage , and we are going to make the results available with associated business and code examples initially with SQL Server and data bricks plus python
My good friend, Jordan Martz of DataMartz fame has greatly contrinuted to this old mans BigData enlightenment as well as Craig Campbell in updating some of the basic classification capabilities required and critical for data governance. If you would like a more detailed version of the source as well as the test data, please send me an email at iwhiteside@msn.com. Stay tuned for more update and soon we will add Neural Network capability for additional automation of “Governance Type” automated classification and confidence monitoring.
Before we focus on functionality let’s focus on methodology
Initially understand key metrics to be measured/KPI‘s their formulas and of course teh businesse’s expectation of their calculations
Immediately gather file sources and complete profiling as specified in my original article found here
Implementing the processes in my meta-data mart article would provide numerous statistics regarding integers or float field however there are some special considerations for text fields or smart codes
Before beginning classification you would employ similarity matching or fuzzy matching as described here
As I said I posted the code for this process on SQL Server Central 10 years ago here is s Python Version.
databricks-logo Roll You Own – Python Jaro_Winkler(Python)
databricks-logoroll You Own – Python Jaro_Winkler(Python)
Import Notebook
Step 1a – import pandas
import pandas
Step 2 – Import Libraries
libraries from pyspark.sql.functions import input_file_name
from pyspark.sql.types import *
import datetime, time, re, os, pandas
ML Libraires
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, NGram, HashingTF, IDF, Word2Vec, Normalizer, Imputer, VectorAssembler
from pyspark.ml import Pipeline
import mlflow
from mlflow.tracking import MLFlowClient
from sklearn.cluster import KMeans
import numpy as np
Step 3 – Test JaroWinkler
JaroWinkler(‘TRAC’,’TRACE’)
Out[5]: 0.933333
Step 4a =Implement JaroWinkler(Fuzzy Matching)
%python
def JaroWinkler(str1_in, str2_in):
if(str1_in is None or str2_in is None):
return 0
tr=0
common=0
jaro_value=0
len_str1=len(str1_in)
len_str2=len(str2_in)
column_names=[‘FID’,’FStatus’]
df_temp_table1 = pandas.DataFrame(columns=column_names)
df_temp_table2 = pandas.DataFrame(columns=column_names)
#clean_string(str1_in)
#clean_string(str2_in)
if len_str1 > len_str2:
swap_len=len_str2
len_str2=len_str1
len_str1=swap_len
swap_str=str1_in
str1_in=str2_in
str2_in=swap_str
max_len=len_str2
iCounter=1
while(iCounter <= len_str1):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table1=pandas.concat([df_temp_table1,df], ignore_index=True)
iCounter=iCounter+1
iCounter=1
while (iCounter <= len_str2):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table2=pandas.concat([df_temp_table2,df], ignore_index=True)
iCounter=iCounter+1
iCounter=1
m=round((max_len/2)-1,0)
i=1
while(i <= len_str1):
a1=str1_in[i-1]
if m >= i:
f=1
z=i+m
else:
f=i-m
z=i+m
if z > max_len:
z=max_len
while (f <= z):
a2=str2_in[int(f-1)]
if(a2==a1 and df_temp_table2.loc[f-1].at['FStatus']==0):
common = common + 1
df_temp_table1.at[i-1,'FStatus']=1
df_temp_table2.at[f-1,'FStatus']=1
break
f=f+1
i=i+1
i=1
z=1
while(i <= len_str1): v1Status=df_temp_table1.loc[i-1].at[‘FStatus’] if(v1Status==1): while(z <= len_str2): v2Status=df_temp_table2.loc[z-1].at[‘FStatus’] if(v2Status==1): a1=str1_in[i-1] a2=str2_in[z-1] z=z+1 if(a1 != a2 ): tr = tr+0.5 break break i=i+1 wcd = 1.0/3.0 wrd = 1.0/3.0 wtr = 1.0/3.0 if (common != 0): jaro_value = (wcd * common)/ len_str1 + (wrd * common) / len_str2 + (wtr * (common – tr)) / common return round(jaro_value,6) Step 4b – Register JaroWinkler spark.udf.register(“JaroWinkler”, JaroWinkler) Out[6]:
Step8a – Bridge vs Master vs AssociativeALL
%sql
DROP TABLE IF EXISTS NameAssociative;
CREATE TABLE NameAssociative;
SELECT
Name
,NameInput
,sha2(replace ( NameLookput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameLookupCeaned ,a.NameLookupKey
,sha2(replace( NameInput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameInput,b.NameInputKey
,JaroWinkler(a.NameLookup, b.NameInput) MatchScor
,RANK() OVER (Partition by a.DetailedBUMaster ORDER BY JaroWinkler(a.NameLookupCleande, b.NameInputCleaned) DESC) NameLookup,b.NameLookupKey
FROM NameInput as a
CROSS JOIN NameLookup as b