Tag Archives: MDM

Chocolate cake, MDM, data quality, machine learning and creating the information value chain’

The primary take away from this article will be that you don’t start your Machine Learning project, MDM , Data Quality or Analytical project with “data” analysis, you start with the end in mind, the business objective in mind. We don’t need to analyze data to know what it is, it’s like oil or water or sand or flour.

Unless we have a business purpose to use these things, we don’t need to analyze them to know what they are. Then because they are only ingredients to whatever we’re trying to make. And what makes them important is to what degree they are part of the recipe , how they are associated

Business Objective: Make Desert

Business Questions: The consensus is Chocolate Cake , how do we make it?

Business Metrics: Baked Chocolate Cake

Metric Decomposition: What are the ingredients and portions?

2/3 cup butter, softened

1-2/3 cups sugar

3 large eggs

2 cups all-purpose flour

2/3 cup baking cocoa

1-1/4 teaspoons baking soda

1 teaspoon salt

1-1/3 cups milk

Confectioners’ sugar or favorite frosting

So here is the point you don’t start to figure out what you’re going to have for dessert by analyzing the quality of the ingredients. It’s not important until you put them in the context of what you’re making and how they relate in essence, or how the ingredients are linked or they are chained together.

In relation to my example of desert and a chocolate cake, an example could be, that you only have one cup of sugar, the eggs could’ve set out on the counter all day, the flour could be coconut flour , etc. etc. you make your judgment on whether not to make the cake on the basis of analyzing all the ingredients in the context of what you want to, which is a chocolate cake made with possibly warm eggs, cocunut flour and only one cup of sugar.

Again belaboring this you don’t start you project by looking at a single entity column or piece of data, until you know what you’re going to use it for in the context of meeting your business objectives.

Applying this to the area of machine learning, data quality and/or MDM lets take an example as follows:

Business Objective: Determine Operating Income

Business Questions: How much do we make, what does it cost us.

Business. Metrics: Operating income = gross income – operating expenses – depreciation – amortization.

Metric Decomposition: What do I need to determine a Operating income?

Gross Income = Sales Amount from Sales Table, Product, Address

Operating Expense = Cost from Expense Table, Department, Vendor

Etc…

Dimensions to Analyze for quality.

Product

Address

Department

Vendor

You may think these are the ingredients for our chocolate cake in regards to business and operating income however we’re missing one key component, the portions or relationship, in business, this would mean the association,hierarchy or drill path that the business will follow when asking a question such as why is our operating income low?

For instance the CEO might first ask what area of the country are we making the least amount of money?

After that the CEO may ask well in that part of the country, what product is making the least amount of money and who manages it, what about the parts suppliers?

Product => Address => Department => Vendor

Product => Department => Vendor => Address

Many times these hierarchies, drill downs, associations or relationships are based on various legal transaction of related data elements the company requires either between their customers and or vendors.

The point here is we need to know the relationships , dependencies and associations that are required for each business legal transaction we’re going to have to build in order to link these elements directly to the metrics that are required for determining operating income, and subsequently answering questions about it.

No matter the project, whether we are preparing for developing a machine learning model, building an MDM application or providing an analytical application if we cannot provide these elements and their associations to a metric , we will not have answered the key business questions and will most likely fail.

The need to resolve the relationships is what drives the need for data quality which is really a way of understanding what you need to do to standardize your data. Because the only way to create the relationships is with standards and mappings between entities.

The key is mastering and linking relationships or associations required for answering business questions, it is certainly not just mastering “data” with out context.

We need MASTER DATA RELATIONSHIP MANAGEMENT

not

MASTER DATA MANAGEMENT.

So final thoughts are the key to making the chocolate cake is understanding the relationships and the relative importance of the data/ingredients to each other not the individual quality of each ingredient.

This also affects the workflow, Many inexperienced MDM Data architects do not realize that these associations form the basis for the fact tables in the analytical area. These associations will be the primary path(work flow) the data stewards will follow in performing maintenance , the stewards will be guided based on these associations to maintain the surrounding dimensions/master entities. Unfortunately instead some architects will focus on the technology and not the business. Virtually all MDM tools are model driven APIs and rely on these relationships(hierarchies) to generate work flow and maintenance screen generation. Many inexperienced architects focus on MVP(Minimal Viable Product), or technical short term deliverable and are quickly called to task due to the fact the incurred cost for the business is not lowered as well as the final product(Chocolate Cake) is delayed and will now cost more.

Unless the specifics of questionable quality in a specific entity or table or understood in the context of the greater business question and association it cannot be excluded are included.

An excellent resource for understanding this context can we found by following: John Owens

Final , final thoughts, there is an emphasis on creating the MVP(Minimal Viable Product) in projects today, my take is in the real world you need to deliver the chocolate cake, simply delivering the cake with no frosting will not do,in reality the client wants to “have their cake and eat it too”.

Note:

Operating Income is a synonym for earnings before interest and taxes (EBIT) and is also referred to as “operating profit” or “recurring profit.” Operating income is calculated as: Operating income = gross incomeoperating expenses – depreciation – amortization.

MDS Versions, Flags and Subscriptions

Recently I collaborated with Andrew Milner in a scenario in MDS SQL 2012 to provide the client the ability to with between multiple Versions and Flags via Subscription Views. Vishal Singh did an excellent job in implementing this scenario in MDS.

Versions and Flags

MDS Model defines the structure of the Data and Version defines the content of the data. In Master Data Services, multiple versions of the master data can be created within a model. Versions created in model can be in sequence fashion (one version after another) or in simultaneous fashion (multiple versions in parallel). This way we ensure we have a dedicated/consistent version of our master data for downstream consumption while at the same time having another version for adding/deleting/modifying members or for testing our model data, without compromising consistency in master data availability.

For example: We may want to have one version of the data marked as “current”, and another set marked as “proposed” using Version Flags. When the proposed version is confirmed as valid and clean, we can move that from proposed to current. So the version marked as “current” will always be the “Golden Copy of the Data”

When to Use Versions

Use versions to:

  • Prevent users from making changes while all data validates successfully against business rules.
  • Lock down a model for use by subscribing systems.

Version Flags

When a version is ready for users or for a subscribing system, we can set a flag to identify the version. We can move flag from version to version as required. Flags help user and downstream applications identify which version of a model to use.

Scenario:

Create Model and 3 Versions (one in Commit Status and other two in Open Status).

An initial version is created by default when a model is created. In order to create two other versions in Open status, commit default version and create two versions from the committed version. Name two versions as PROD and TEST and map it to two different Flags PROD and TEST respectively. Data inserted in two different versions are mutually exclusive meaning they are not related to each other. Data modified in one of the version will not be reflected in another version. Subscription views created based on Version will also have data related to that Version only. Flags help user and downstream applications identify which version of a model to use.

For example: Created a sample Model called “Employee” and created three versions VERSION_1 (committed status), MDS (open status) and SSIS (open status). In the same way created two flags MDS and SSIS mapped to MDS and SSIS versions respectively. Data present in different versions (MDS and SSIS) are not related and thus not in synch.

In Employee entity there is one employee which works in two different technologies MDS and SSIS as highlighted below. Employee (Employee ID: IW003) works in both SSIS and MDS technology but there is a difference in their name in different versions.

mds1

mds2

Records highlighted above are common in MDS and SSIS versions of Employee Entity with only difference in their Name (EmployeeID: IW003) . Subscription view created on the MDS (linked to MDS flag) and SSIS (linked to SSIS flag) versions will have different set of data. Please find below the data present in subscription-view:

mds2sv

Downstream application will fetch data based on the subscription views which they are consuming. If downstream application is consuming MDS version (mapped to MDS flag) will have different data and one consuming SSIS version (mapped to SSIS flag) will have different of data.

Architecture in MDS Acme

We have used below logic in MDS Acme:

mds3arch

In MDS Acme we have two different data sources Production and Test. Data coming from different data sources are getting inserted in the same model (Sample) and same entities. Two versions V1 and V2 are created pointing to Production and Test data sources. Versions V1 and V2 are mapped to Flags F1 and F2 respectively. Subscription views created are linked to Version flags. Downstream applications are consuming data based on the subscription view published. When a new Version needs to be created called V3 which is copy of version V2 below steps need to be done:

  1. Validate and commit old version V2 as in this example.
  2. Copy the old version and create a new version called V3 from V2
  3. Remove the flag F1 from the old version V2
  4. Map flag F1 to newly created version V3

As the subscription views are created based on the Version Flags and not on Version. So we don’t have to change the subscription views for the newly created version. Downstream application will now be able to fetch data from the new versions instead of old versions.

Informatica Cloud MDM for Salesforce (formerly Data Scout) Review

 

Tactically improving Data Quality and incrementally achieving Data Governance and metadata management  is a natural path and MDM  is the center of that strategy. See Gartner Group’s Applying Data Mart and Data Warehousing Concepts to Metadata Management

 

pyramid

In Metadata Mart the Road to Data Governance or Guerilla Data Governance I outline this approach

I’ve just completed a Data Governance Assessment and review of Informatica CLOUD MDM(formerly Data Scout for Salesforce) with my colleague and excellent Solution Architect Baliji Kkarade .  The client in this case is interested in implementing Informatica CLOUD MDM in Salesforce , as a tactical approach to improving Data Quality and incrementally improving Data Governance . I’d like to aknowledge the incredible insight I gained from Balaji Kharade in this effort.

In general and product is positioned to provide a transactional MDM within Salesforce. We will cover the steps for implementation and some back ground on Fuzzy Matching or de-duplication.

We will walk thru the steps for setting up the tool.

  1. Cloud MDM Settings
  2. Cloud MDM Profile
  3. Adding Cloud related Information to Page Layout
  4. Synchronization Settings
  5. Data Cleansing
  6. Fuzzy Matching and Segments
  7. External Data Sources
  8. Consolidation and Enrichment
  9. Limitations

 

This post assumes familiarity with the Saleforce architecture.

 

1. Cloud MDM Settings

  • Cloud MDM master on/off switch is configured using this setting and other settings like extracting the legal form and domain , overriding Salesforce Account information using Master bean after Match and Merge in Cloud MDM, and Standardizing Country.
  • In some cases, you may wish to turn off Cloud MDM after you have installed and configured it.
  • For example, if you wish to bring in a new set of data without creating beans. To achieve this, you need to switch Cloud MDM off.

mdmsettings

2. Cloud MDM Profile:

  • When Cloud MDM is installed, a default profile is given to all users. In order for your user to get access to all the features of Cloud MDM, you must configure an admin or super user profile. When you implement Cloud MDM, you can use profiles to assign MDM functionality to Salesforce user profiles.
  • Users can have Permissions to Create/Update/Merge/consolidate Account, Contact and Leads, Create/Ignore duplicate Account, Contact and Leads, View consolidated information and create/Edit Hierarchy information

mdmsettings2

3. Add Cloud Related Information to Page Layout:

Helps to add MDM related components like Consolidated view, Find duplicates, MDM related fields like Synchronize, Legal forms, ISO country ,duplicate Account section, Related beans and Master Beans etc .

mdmpageinfos

mdmpageinfos2

4. Synchronization Settings:

  • This setting helps in Synchronizing/Mapping the Salesforce Attributes to Cloud MDM stage Area.
  • We can map Standard fields and 10 custom fields. These standard and custom fields help us in configuring segment settings and match strategy in cloud MDM.
  • Sync job helps creating beans and Master beans in cloud MDM stage Area.

mdmaccountsync

5. Data Cleansing:

Data cleansing ensures the data is in a consistent format. Consistent data improves the quality of reporting and also improves matching results and the accuracy of duplicate detection.

Legal Form :

Legal form normalization is the process of extracting the legal form from the company norm and populating the legal form field with normalized data.

For example, We can configure the legal form field to contain normalized data for business entity designations such as Limited, Ltd., and L.T.D. We can add legal forms to the list available already after profoiling our data set.

Domain Normalization :

We can enable Cloud MDM to populate the domain field with a domain extracted from the website field. Cloud MDM uses the domain field during fuzzy matching.

For example, if a user enters http://www.acme.com/products or www. acme.com in the website field, Cloud MDM can populate the domain field with acme.com. normalized domain ensures domain field consistency and improves match results.

mdmdsatacleansing

6. Fuzzy Matching and Segments:

Segment :

The segment field in the master bean record contains a matching segment. The matching segment is a string of characters that Cloud MDM uses to filter records before it performs fuzzy matching.

To improve fuzzy match performance, Cloud MDM performs an exact match on the matching segments to eliminate records that are unlikely to match. Cloud MDM then performs fuzzy matching on the remaining records. This can be basically creating Categories and Groups in Advanced Fuzzy Matching or “Blocking Indexes” Record Linkage. This will be created for all the Accounts once the Sync between Salesforce and MDM is done. It is also generated for external beans.

Fuzzy matching :

Fuzzy matching can match strings that are not exactly the same but have similar characteristics and similar patterns.

One example of a Fuzzy Matching algorithm is LevenShtein, the original Fuzzy algorithm Levenshtein  Distance or Edit Distinceinvented in 1965

Levenshtein:

Counts the number of incorrect characters, insertions and deletions.

Returns:

(maxLen – mistakes) / maxLen

Levenshtein is a good algorithm for catching keyboarding errors

mdmFuzzy1

Matching is a two step process that determines a match score between two records. First, Cloud MDM performs an exact match on the matching segments to exclude records that are unlikely to have matches. Then, Cloud MDM performs a fuzzy match on the remaining records to calculate a match score between pairs of records. If the match score of the two records achieves the match score threshold, Cloud MDM considers the two records a match.

mdmsegment settingsmdmpageinfos2 mdmpageinfos

7. External Data Source:

We have external data in a system, such as SAP or Oracle EBS. We wish to load this data directly into beans, so we can take some of the information (SIC Code, No of Employees.) from the SAP record, and retain some information (e.g. Company Name) from the Salesforce record.

This setting allows us to configure the external Data source and defining the trust/priority score i.e. which value will win over the other during Consolidation and Enrichment process.

 

7ext

8. Consolidation and Enrichment.

Consolidation :

The consolidated view allows us to look at all beans associated with a master bean. In order to use this view, we must configure the fields that will display in the list of associated beans, as well as the account address information. This is done by configuring field sets.

Enrichment:

This setting allows us to over write the value from the Master bean to the Salesforce Org Account based on the trust/priority score provided during the configuration of the external Data source. We can use the override account option in cloud MDM settings to prevent the automatic override of the Salesforce Org Account accordingly.

8ent

 

9. Limitations.

 

There are two primary limitations:

 

  1. Custom Fields are limited to 10 and only 6 can be used in syncing.
  2. High volume matching from External Source is completed in a “Pre Match” process, which is basically accessing the “Master Bean” externally and developing ETL Process with another tool.