Advanced  Fuzzy Matching via Record Linkage Methodology

This article will address the necessary steps for efficient data/record processing that includes a record linkage or fuzzy matching step. First we are going to define the matching process, well-researched solutions and inherent performance problem.

  

Any record linkage operation will ultimately require string matching and will require comparing some columns in a complete set of records to all the records in another set – effectively a cross join. This is a very resource intensive process. The trick here is to reduce the number of string comparisons as much as possible. This problem has been researched by many academics.
Over at Henrik Liliendahl Sorensen’s LinkedIn Group for Data Matching, Bill Winkler, principal researcher at the US Census Bureau has shared several reference papers on the reasoning and methodology for record linkage using blocking. They are excellent and I wanted to share them with you:
Chaudhuri, S., Gamjam, K., Ganti, V., and Motwani, R. (2003), “Robust and Efficient Match for On-Line Data Cleaning,” ACM SIGMOD ’03, 313-324.

Baxter, R., Christen, P. and Churches, T. (2003), “A Comparison of Fast Blocking Methods for Record Linkage,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC, August 2003. 

Winkler, W. E. (2004c), “Approximate String Comparator Search Strategies for Very Large Administrative Lists,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM.

Here are several papers I have used with some samples of open-source algorithms:
William W. Cohen; Pradeep Ravikumar; Stephen Fienberg; Kathryn Rivard. A technical paper on string-matching methods which uses SecondString (PDF).

Mikhail Bilenko and Raymond Mooney, University of Texas at Austin; William Cohen, Pradeep Ravikumar, and Stephen Fienberg, Carnegie Mellon University, “Adaptive Name Matching in Information Integration“.

William W. Cohen Pradeep Ravikumar Stephen E. Fienberg, “A Comparison of String Distance Metrics for Name-Matching Tasks“.

How our brains handle matching challenges

Consider this: you are watching a school concert and several dozen children are up on stage. Now, pick out the twins. You would probably start with looking for groups based on hair color, hair length, etc. long before you start comparing faces. This is, in essence, grouping or blocking. If you line the blonds on the left and the brunettes on the right, you now have two blocks.
  

The next step in identifying the twins is to repeat the process with new groups until you have found the twins. Compare all blonds, then brunettes, and so on. Then, move on to short hair, long hair, and so on. Finally, move on to similar face shapes. This is fuzzy matching.
Hair is blond or brunette, long or short; but faces are a collection of features, and have a pattern forming an image. Our brains instinctively look for faces that are similar, then compare more closely. The obvious point here is to only begin comparing faces once we have narrowed down to just a small group of children.
Let’s examine the steps involved as a logical process.
  

1. Cleansing and Standardization

The objective here is to cleanse and format the required columns. The key here is to get as much consistency and uniformity in each column as possible. For instance in columns like Zip Code or Address you want to have correct values and proper patterns. This is critical to reducing unnecessary processing time during the match process. One way to automate this step is to use data profiling with some pattern matching or REGEX (Regular Expression) cleansing.
2. Categorize or Group records

Once we have cleansed and standardized our input we can then examine how to best split, group or categorize our records into separate sets. An example of this would be to put records for each states (Arkansas, Maine etc.) into a group. In Record Linkage jargon, this is called a Blocking Index. 
The thinking here is that we want to use as many columns for equality matches as possible, thereby reducing the records that need fuzzy matching. Obviously if we can match on state, zip, and phone it will leave only the name field as a candidate for a fuzzy match. Grouping sets of records by state or by the first three characters of the phone number means we need to process fewer records during the match. As we discussed earlier, a fuzzy match is in effect a cross join and can quickly result in a very high volume of record and column comparisons.
William Winkler, the Chief Researcher for Census Bureau, has detailed the recommended groups of column types that can be most effective. The best groups or Blocking Indexes vary depending on the columns available. Winkler particularly recommends the top 5:
ZIP, first three characters of surname.

First ten digits of phone number.

First three characters of last name, first three characters of first name.

ZIP, house number.

First three characters of ZIP, day-of-birth, month-of-birth.

First three characters of last name, first three characters of first name, month-of-birth.

First three characters of surname, first three characters of phone, house number.

First three characters of surname, first three characters of first name, date-of-birth.

First three characters of first name, first three characters of ZIP, house number.

First three characters of last name, first three characters of ZIP, first three characters of phone number.

First three characters of last name, first three characters of first name (2-way switch), first three characters of ZIP, first three characters of phone number.

3. Split 

Split records, based on the criteria in the Blocking Indexes. Create separate data streams to support parallel match processing.
4. Compare

Compare by applying fuzzy matching algorithm to groups of records and determine scores based on the groups selected. We will discuss various algorithms in a future post. The huge problem with the fuzzy matching processing performance is that the matches are similar and not exact. If you get a commercial matching tool and start comparing data sets, I guarantee the tool will frantically try to get you to define as many columns or fields for a match (Blocking Index) as possible, before you do any fuzzy stuff.
5. Split 

Split into separate result categories of match, no match and possible matches.
6. Analyze

Analyze results of no matches and possible matches. Matches need to be reviewed for accuracy. this can be done with tools or in some cases manually.
7. Evaluate

Evaluate results manually using matching tools to determine if the best algorithms have been combined. Possible matches need to be evaluated and analyzed. Determine if additional cleansing or different matching algorithms could be utilized more effectively.
SSIS Example of Fuzzy Matching with Blocking Indexes

In the example below I will demonstrate Steps 2-5 using the SSIS Fuzzy Grouping and Conditional Split Components as well as the SSIS Pipeline Architecture for parallel processing. In SSIS the basic process for implementing blocking indexes involves using a Conditional Split to create multiple threads of records and applying the Fuzzy Grouping Transform. For this example we assume a basic customer input with: First Name, Last Name, Address, State, Zip, and Phone.
We will split the records based on State in the following manner:
Block01 = AK, AL, AZ, CA, CO

Block02 = FL, GA, MI, MA, MN

Block02 = Everything else.

Putting the theory to the test

I created a package using AdventureWorks that processes and uses Fuzzy Grouping for 10,000 records without using a blocking index and the execution time was 39 minutes.
I then ran the same 10,000 records with a conditional split, creating a blocking index with three parallel threads by blocking on state. Using a blocking index, the execution time was 20 minutes.
In this case the execution time was cut by almost half. When processing larger volumes, the net reduction in time would be much greater. 

The following is a detailed no need to refer them and less you’re interested in understanding an actual Microsoft example. Other examples are available contact me

Microsoft SSIS example
How to do it

Here is the setup for the Fuzzy Grouping. This transform has a very straightforward configuration. Once you have selected the desired input columns you can then select the Match Type, Fuzzy or Exact. You will note that except for First and Last Names, all columns are set for an exact match. First and Last Name will be set to Fuzzy. This set up will be the same for both tests. For more detailed information on how to set up the Fuzzy Group Transform, take a look here.

  

Here is the first package without splitting records via a blocking index.

  
  
Run this test first. Note the time elapsed in the Execution Status tab.
Now let’s change the packages and implement what we have discussed. We will split the paths, leveraging the parallelism and the pipeline capabilities in SSIS. Revise the package by adding a Conditional Split and multiple Fuzzy Grouping Transforms.

  

 

Here is the setup for the Condition Split used to implement the blocking index, where we split our records into three blocks of states. We may still need to check across states, but all their matches will be eliminated.
Once you have configured the Conditional Split you need to connect the separate outputs to the original Union All component. If you are not familiar with the Conditional Split check out Andy Leonard’s excellent post here.

   
One more important point on package performance: there are two properties that can help improve performance for SSIS DataFlows when properly configured. These are DefaultBufferMaxRows and DefaultBufferSize, which are covered well in SQLRUNNER’s blog.
Summary

The major takeaway from this article is that you should apply the same logical thinking to using the SSIS Fuzzy Grouping as you do in your own everyday thinking and matching problems.
Secondly, the capability that enables this technique to succeed is the pipeline architecture in SSIS.

On a final note, here is the definition of the pipeline architecture as defined by Microsoft. I have included this in my other SQL Server Central articles on Fuzzy Matching and I think it is important here to take a few minutes to review this critical capability at the heart of the SSIS architecture:
“At the core of SSIS is the data transformation pipeline. This pipeline has a buffer-oriented architecture that is extremely fast at manipulating row sets of data once they have been loaded into memory. The approach is to perform all data transformation steps of the ETL process in a single operation without staging data, although specific transformation or operational requirements, or indeed hardware may be a hindrance. Nevertheless, for maximum performance, the architecture avoids staging. Even copying the data in memory is avoided as far as possible. This is in contrast to traditional ETL tools, which often requires staging at almost every step of the warehousing and integration process. The ability to manipulate data without staging extends beyond traditional relational and flat file data and beyond traditional ETL transformation capabilities. With SSIS, all types of data (structured, unstructured, XML, etc.) are converted to a tabular (columns and rows) structure before being loaded into its buffers. Any data operation that can be applied to tabular data can be applied to the data at any step in the data-flow pipeline. This means that a single data-flow pipeline can integrate diverse sources of data and perform arbitrarily complex operations on these data without having to stage the data.
It should also be noted though, that if staging is required for business or operational reasons, SSIS has good support for these implementations as well.

This architecture allows SSIS to be used in a variety of data integration scenarios, ranging from traditional DW-oriented ETL to non-traditional information integration technologies.”
  
Ira Warren Whiteside Blog
“Do, or do not. There is no try.”
“Karo yaa na karo, koshish jaisa kuch nahi hai.”

Posted in Uncategorized | Leave a comment

The secret path to Data Quality and Busines Clarity

The secret to implementation of data quality is to follow the path below, mainly very Business Driven and focused approach extremely iterative and collaborative

The software or tools may change but the logical path, defined by identifying important business measurements required for successful and measurable results will not.

The key is not think of it as some kind of technical POC or tool trial.

It is important to realize “What” you want to measure and there by understand will not change, only “How” you create the result will change.

While many organizations are led down the path of creating a Data Governance Program, it’s frankly to large of a task, and more importantly cannot adequately be planned, with first implementing a Data Quality program, with analytical capabilities.

For example in the real world if you wanted to drill an oil well, first and before you plan, budget, move or buy equipment you would,  survey the land, examine the  minerals and drill a test well. This is not the same as in our IT data world as doing a vendor or tool Proof of Concept(POC) or a pilot to see if the vendor product works.

The oil company know exactly how there equipment works and the processes they will follow, they are trying to determine “where” to drill , not “how” to drill .

In our world , the IT WORLD, we act as if we need to “somehow” complete a “proof of concept” without really know exactly what concept we’re proving.

Are we proving the tool works, are we proving our data has errors or  our processes are flawed in essence we verifying that if we find bad data we want to fix them or the data, none of these concepts need “proving”.

My point is proving these low level concepts is probably worthless to the business and maybe even destructive, unless they are associated with a actual set of Business Goals or Measurements and they are linked directly with understandable Business deliverables. This is my way of saying put this information in an organized set of  spreadsheets linking business metrics, required fields and the order you analyze them and follow a proven process to document them and provide deliverables for both the business and technical needs.

When I say linking I mean creating an “information value chain” relating Business Goals :to Business Questions and breaking them down(decomposing)  them into the following:

  1. Business Goal – Corporate Objectives
  2. Business Question – Question needed for managing meeting the objectives.
  3. Metric – Specific formulas required. (Profit= Revenue – Expenses)
  4. Hierarchies – The order to the fields(attributes) necessary to analyze or drill down on the metrics.(Product, Department, Time)
  5. Dimension s  Natural grouping of attributes relating to each other.(Customer, Name, Address etc…)
  6. Business Matrix – Cross reference or matrix showing relationships between Business Questions, Business Processes, Metric and Dimensions. Comparing you business model to your data model.

 

The methodology for building the information value chain is as follows:

 

metric

 

 

Following this approach as the diagram shows will yield a data model and application architecture that will support answering actual business questions and provide the foundation to continue the path to data governance or to simply hold in place and explore you data to better understand your issues, their impact and then plan and prioritize your next step

Follow the path, pick a “real” goal or measurement , preferably one that matters

After that follow the path in the diagram

 

path

 

 

Posted in Uncategorized | Tagged , , | 1 Comment

Business Governance from chaos to clarity with Data Quality improvements and Data Quality Mart.

chaos

Posted in agile, Data Governance, data quality, data quality mart, Uncategorized | Tagged , , , | Leave a comment

Crap vs Data Governance

crap

Posted in agile, Data Governance, Uncategorized | Tagged , , | Leave a comment

MDS Versions, Flags and Subscriptions

Recently I collaborated with Andrew Milner in a scenario in MDS SQL 2012 to provide the client the ability to with between multiple Versions and Flags via Subscription Views. Vishal Singh did an excellent job in implementing this scenario in MDS.

Versions and Flags

MDS Model defines the structure of the Data and Version defines the content of the data. In Master Data Services, multiple versions of the master data can be created within a model. Versions created in model can be in sequence fashion (one version after another) or in simultaneous fashion (multiple versions in parallel). This way we ensure we have a dedicated/consistent version of our master data for downstream consumption while at the same time having another version for adding/deleting/modifying members or for testing our model data, without compromising consistency in master data availability.

For example: We may want to have one version of the data marked as “current”, and another set marked as “proposed” using Version Flags. When the proposed version is confirmed as valid and clean, we can move that from proposed to current. So the version marked as “current” will always be the “Golden Copy of the Data”

When to Use Versions

Use versions to:

  • Prevent users from making changes while all data validates successfully against business rules.
  • Lock down a model for use by subscribing systems.

Version Flags

When a version is ready for users or for a subscribing system, we can set a flag to identify the version. We can move flag from version to version as required. Flags help user and downstream applications identify which version of a model to use.

Scenario:

Create Model and 3 Versions (one in Commit Status and other two in Open Status).

An initial version is created by default when a model is created. In order to create two other versions in Open status, commit default version and create two versions from the committed version. Name two versions as PROD and TEST and map it to two different Flags PROD and TEST respectively. Data inserted in two different versions are mutually exclusive meaning they are not related to each other. Data modified in one of the version will not be reflected in another version. Subscription views created based on Version will also have data related to that Version only. Flags help user and downstream applications identify which version of a model to use.

For example: Created a sample Model called “Employee” and created three versions VERSION_1 (committed status), MDS (open status) and SSIS (open status). In the same way created two flags MDS and SSIS mapped to MDS and SSIS versions respectively. Data present in different versions (MDS and SSIS) are not related and thus not in synch.

In Employee entity there is one employee which works in two different technologies MDS and SSIS as highlighted below. Employee (Employee ID: IW003) works in both SSIS and MDS technology but there is a difference in their name in different versions.

mds1

mds2

Records highlighted above are common in MDS and SSIS versions of Employee Entity with only difference in their Name (EmployeeID: IW003) . Subscription view created on the MDS (linked to MDS flag) and SSIS (linked to SSIS flag) versions will have different set of data. Please find below the data present in subscription-view:

mds2sv

Downstream application will fetch data based on the subscription views which they are consuming. If downstream application is consuming MDS version (mapped to MDS flag) will have different data and one consuming SSIS version (mapped to SSIS flag) will have different of data.

Architecture in MDS Acme

We have used below logic in MDS Acme:

mds3arch

In MDS Acme we have two different data sources Production and Test. Data coming from different data sources are getting inserted in the same model (Sample) and same entities. Two versions V1 and V2 are created pointing to Production and Test data sources. Versions V1 and V2 are mapped to Flags F1 and F2 respectively. Subscription views created are linked to Version flags. Downstream applications are consuming data based on the subscription view published. When a new Version needs to be created called V3 which is copy of version V2 below steps need to be done:

  1. Validate and commit old version V2 as in this example.
  2. Copy the old version and create a new version called V3 from V2
  3. Remove the flag F1 from the old version V2
  4. Map flag F1 to newly created version V3

As the subscription views are created based on the Version Flags and not on Version. So we don’t have to change the subscription views for the newly created version. Downstream application will now be able to fetch data from the new versions instead of old versions.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Reference Data Management Implementation: Four Key Considerations You May Be Overlooking

Great article in discussing Reference Data Management or RDM and it’s relation to MDM. Also important to remember that this is an excellent starting point in organically evolving the underlying processes in support of MDM and a Data Governance organization.

InfoTrellis blog

In recent years reference data management (RDM) has slowly crept into the forefront of business decision-makers’ consciousnesses, making its way steadily upwards in priority within corporate goals and initiatives. Organizations are suddenly seeing the benefits of investing in RDM, attention grabbed by potential paybacks like smoother interoperability among various functions of the organization and centralized ownership and accountability in creating trustworthy data.

Before we dive into talking about approaches for implementation, I want to look at the potential significance of RDM in an enterprise. When the market is inclined towards concepts like Data Integration, MDM and Business Intelligence tend to hog the spotlight. For these particular corporate initiatives, the primary focus of data is key business information like customers, products or suppliers. It is equally important, however, to appreciate the fact that reference data plays a major role in organizing and comprehending all these key pieces of business data.

Whenever…

View original post 846 more words

Posted in Uncategorized | Leave a comment

Informatica Cloud MDM for Salesforce (formerly Data Scout) Review

 

Tactically improving Data Quality and incrementally achieving Data Governance and metadata management  is a natural path and MDM  is the center of that strategy. See Gartner Group’s Applying Data Mart and Data Warehousing Concepts to Metadata Management

 

pyramid

In Metadata Mart the Road to Data Governance or Guerilla Data Governance I outline this approach

I’ve just completed a Data Governance Assessment and review of Informatica CLOUD MDM(formerly Data Scout for Salesforce) with my colleague and excellent Solution Architect Baliji Kkarade .  The client in this case is interested in implementing Informatica CLOUD MDM in Salesforce , as a tactical approach to improving Data Quality and incrementally improving Data Governance . I’d like to aknowledge the incredible insight I gained from Balaji Kharade in this effort.

In general and product is positioned to provide a transactional MDM within Salesforce. We will cover the steps for implementation and some back ground on Fuzzy Matching or de-duplication.

We will walk thru the steps for setting up the tool.

  1. Cloud MDM Settings
  2. Cloud MDM Profile
  3. Adding Cloud related Information to Page Layout
  4. Synchronization Settings
  5. Data Cleansing
  6. Fuzzy Matching and Segments
  7. External Data Sources
  8. Consolidation and Enrichment
  9. Limitations

 

This post assumes familiarity with the Saleforce architecture.

 

1. Cloud MDM Settings

  • Cloud MDM master on/off switch is configured using this setting and other settings like extracting the legal form and domain , overriding Salesforce Account information using Master bean after Match and Merge in Cloud MDM, and Standardizing Country.
  • In some cases, you may wish to turn off Cloud MDM after you have installed and configured it.
  • For example, if you wish to bring in a new set of data without creating beans. To achieve this, you need to switch Cloud MDM off.

mdmsettings

2. Cloud MDM Profile:

  • When Cloud MDM is installed, a default profile is given to all users. In order for your user to get access to all the features of Cloud MDM, you must configure an admin or super user profile. When you implement Cloud MDM, you can use profiles to assign MDM functionality to Salesforce user profiles.
  • Users can have Permissions to Create/Update/Merge/consolidate Account, Contact and Leads, Create/Ignore duplicate Account, Contact and Leads, View consolidated information and create/Edit Hierarchy information

mdmsettings2

3. Add Cloud Related Information to Page Layout:

Helps to add MDM related components like Consolidated view, Find duplicates, MDM related fields like Synchronize, Legal forms, ISO country ,duplicate Account section, Related beans and Master Beans etc .

mdmpageinfos

mdmpageinfos2

4. Synchronization Settings:

  • This setting helps in Synchronizing/Mapping the Salesforce Attributes to Cloud MDM stage Area.
  • We can map Standard fields and 10 custom fields. These standard and custom fields help us in configuring segment settings and match strategy in cloud MDM.
  • Sync job helps creating beans and Master beans in cloud MDM stage Area.

mdmaccountsync

5. Data Cleansing:

Data cleansing ensures the data is in a consistent format. Consistent data improves the quality of reporting and also improves matching results and the accuracy of duplicate detection.

Legal Form :

Legal form normalization is the process of extracting the legal form from the company norm and populating the legal form field with normalized data.

For example, We can configure the legal form field to contain normalized data for business entity designations such as Limited, Ltd., and L.T.D. We can add legal forms to the list available already after profoiling our data set.

Domain Normalization :

We can enable Cloud MDM to populate the domain field with a domain extracted from the website field. Cloud MDM uses the domain field during fuzzy matching.

For example, if a user enters http://www.acme.com/products or www. acme.com in the website field, Cloud MDM can populate the domain field with acme.com. normalized domain ensures domain field consistency and improves match results.

mdmdsatacleansing

6. Fuzzy Matching and Segments:

Segment :

The segment field in the master bean record contains a matching segment. The matching segment is a string of characters that Cloud MDM uses to filter records before it performs fuzzy matching.

To improve fuzzy match performance, Cloud MDM performs an exact match on the matching segments to eliminate records that are unlikely to match. Cloud MDM then performs fuzzy matching on the remaining records. This can be basically creating Categories and Groups in Advanced Fuzzy Matching or “Blocking Indexes” Record Linkage. This will be created for all the Accounts once the Sync between Salesforce and MDM is done. It is also generated for external beans.

Fuzzy matching :

Fuzzy matching can match strings that are not exactly the same but have similar characteristics and similar patterns.

One example of a Fuzzy Matching algorithm is LevenShtein, the original Fuzzy algorithm Levenshtein  Distance or Edit Distinceinvented in 1965

Levenshtein:

Counts the number of incorrect characters, insertions and deletions.

Returns:

(maxLen – mistakes) / maxLen

Levenshtein is a good algorithm for catching keyboarding errors

mdmFuzzy1

Matching is a two step process that determines a match score between two records. First, Cloud MDM performs an exact match on the matching segments to exclude records that are unlikely to have matches. Then, Cloud MDM performs a fuzzy match on the remaining records to calculate a match score between pairs of records. If the match score of the two records achieves the match score threshold, Cloud MDM considers the two records a match.

mdmsegment settingsmdmpageinfos2 mdmpageinfos

7. External Data Source:

We have external data in a system, such as SAP or Oracle EBS. We wish to load this data directly into beans, so we can take some of the information (SIC Code, No of Employees.) from the SAP record, and retain some information (e.g. Company Name) from the Salesforce record.

This setting allows us to configure the external Data source and defining the trust/priority score i.e. which value will win over the other during Consolidation and Enrichment process.

 

7ext

8. Consolidation and Enrichment.

Consolidation :

The consolidated view allows us to look at all beans associated with a master bean. In order to use this view, we must configure the fields that will display in the list of associated beans, as well as the account address information. This is done by configuring field sets.

Enrichment:

This setting allows us to over write the value from the Master bean to the Salesforce Org Account based on the trust/priority score provided during the configuration of the external Data source. We can use the override account option in cloud MDM settings to prevent the automatic override of the Salesforce Org Account accordingly.

8ent

 

9. Limitations.

 

There are two primary limitations:

 

  1. Custom Fields are limited to 10 and only 6 can be used in syncing.
  2. High volume matching from External Source is completed in a “Pre Match” process, which is basically accessing the “Master Bean” externally and developing ETL Process with another tool.

 

Posted in Data Governance, Fuzzy Matching, Guerilla MDM, Guerilla MDS, Mdm, Uncategorized | Tagged , , , , , | Leave a comment