Tag Archives: data quality

Chocolate cake, MDM, data quality, machine learning and creating the information value chain’

The primary take away from this article will be that you don’t start your Machine Learning project, MDM , Data Quality or Analytical project with “data” analysis, you start with the end in mind, the business objective in mind. We don’t need to analyze data to know what it is, it’s like oil or water or sand or flour.

Unless we have a business purpose to use these things, we don’t need to analyze them to know what they are. Then because they are only ingredients to whatever we’re trying to make. And what makes them important is to what degree they are part of the recipe , how they are associated

Business Objective: Make Desert

Business Questions: The consensus is Chocolate Cake , how do we make it?

Business Metrics: Baked Chocolate Cake

Metric Decomposition: What are the ingredients and portions?

2/3 cup butter, softened

1-2/3 cups sugar

3 large eggs

2 cups all-purpose flour

2/3 cup baking cocoa

1-1/4 teaspoons baking soda

1 teaspoon salt

1-1/3 cups milk

Confectioners’ sugar or favorite frosting

So here is the point you don’t start to figure out what you’re going to have for dessert by analyzing the quality of the ingredients. It’s not important until you put them in the context of what you’re making and how they relate in essence, or how the ingredients are linked or they are chained together.

In relation to my example of desert and a chocolate cake, an example could be, that you only have one cup of sugar, the eggs could’ve set out on the counter all day, the flour could be coconut flour , etc. etc. you make your judgment on whether not to make the cake on the basis of analyzing all the ingredients in the context of what you want to, which is a chocolate cake made with possibly warm eggs, cocunut flour and only one cup of sugar.

Again belaboring this you don’t start you project by looking at a single entity column or piece of data, until you know what you’re going to use it for in the context of meeting your business objectives.

Applying this to the area of machine learning, data quality and/or MDM lets take an example as follows:

Business Objective: Determine Operating Income

Business Questions: How much do we make, what does it cost us.

Business. Metrics: Operating income = gross income – operating expenses – depreciation – amortization.

Metric Decomposition: What do I need to determine a Operating income?

Gross Income = Sales Amount from Sales Table, Product, Address

Operating Expense = Cost from Expense Table, Department, Vendor

Etc…

Dimensions to Analyze for quality.

Product

Address

Department

Vendor

You may think these are the ingredients for our chocolate cake in regards to business and operating income however we’re missing one key component, the portions or relationship, in business, this would mean the association,hierarchy or drill path that the business will follow when asking a question such as why is our operating income low?

For instance the CEO might first ask what area of the country are we making the least amount of money?

After that the CEO may ask well in that part of the country, what product is making the least amount of money and who manages it, what about the parts suppliers?

Product => Address => Department => Vendor

Product => Department => Vendor => Address

Many times these hierarchies, drill downs, associations or relationships are based on various legal transaction of related data elements the company requires either between their customers and or vendors.

The point here is we need to know the relationships , dependencies and associations that are required for each business legal transaction we’re going to have to build in order to link these elements directly to the metrics that are required for determining operating income, and subsequently answering questions about it.

No matter the project, whether we are preparing for developing a machine learning model, building an MDM application or providing an analytical application if we cannot provide these elements and their associations to a metric , we will not have answered the key business questions and will most likely fail.

The need to resolve the relationships is what drives the need for data quality which is really a way of understanding what you need to do to standardize your data. Because the only way to create the relationships is with standards and mappings between entities.

The key is mastering and linking relationships or associations required for answering business questions, it is certainly not just mastering “data” with out context.

We need MASTER DATA RELATIONSHIP MANAGEMENT

not

MASTER DATA MANAGEMENT.

So final thoughts are the key to making the chocolate cake is understanding the relationships and the relative importance of the data/ingredients to each other not the individual quality of each ingredient.

This also affects the workflow, Many inexperienced MDM Data architects do not realize that these associations form the basis for the fact tables in the analytical area. These associations will be the primary path(work flow) the data stewards will follow in performing maintenance , the stewards will be guided based on these associations to maintain the surrounding dimensions/master entities. Unfortunately instead some architects will focus on the technology and not the business. Virtually all MDM tools are model driven APIs and rely on these relationships(hierarchies) to generate work flow and maintenance screen generation. Many inexperienced architects focus on MVP(Minimal Viable Product), or technical short term deliverable and are quickly called to task due to the fact the incurred cost for the business is not lowered as well as the final product(Chocolate Cake) is delayed and will now cost more.

Unless the specifics of questionable quality in a specific entity or table or understood in the context of the greater business question and association it cannot be excluded are included.

An excellent resource for understanding this context can we found by following: John Owens

Final , final thoughts, there is an emphasis on creating the MVP(Minimal Viable Product) in projects today, my take is in the real world you need to deliver the chocolate cake, simply delivering the cake with no frosting will not do,in reality the client wants to “have their cake and eat it too”.

Note:

Operating Income is a synonym for earnings before interest and taxes (EBIT) and is also referred to as “operating profit” or “recurring profit.” Operating income is calculated as: Operating income = gross incomeoperating expenses – depreciation – amortization.

The secret path to Data Quality and Busines Clarity

The secret to implementation of data quality is to follow the path below, mainly very Business Driven and focused approach extremely iterative and collaborative

The software or tools may change but the logical path, defined by identifying important business measurements required for successful and measurable results will not.

The key is not think of it as some kind of technical POC or tool trial.

It is important to realize “What” you want to measure and there by understand will not change, only “How” you create the result will change.

While many organizations are led down the path of creating a Data Governance Program, it’s frankly to large of a task, and more importantly cannot adequately be planned, with first implementing a Data Quality program, with analytical capabilities.

For example in the real world if you wanted to drill an oil well, first and before you plan, budget, move or buy equipment you would,  survey the land, examine the  minerals and drill a test well. This is not the same as in our IT data world as doing a vendor or tool Proof of Concept(POC) or a pilot to see if the vendor product works.

The oil company know exactly how there equipment works and the processes they will follow, they are trying to determine “where” to drill , not “how” to drill .

In our world , the IT WORLD, we act as if we need to “somehow” complete a “proof of concept” without really know exactly what concept we’re proving.

Are we proving the tool works, are we proving our data has errors or  our processes are flawed in essence we verifying that if we find bad data we want to fix them or the data, none of these concepts need “proving”.

My point is proving these low level concepts is probably worthless to the business and maybe even destructive, unless they are associated with a actual set of Business Goals or Measurements and they are linked directly with understandable Business deliverables. This is my way of saying put this information in an organized set of  spreadsheets linking business metrics, required fields and the order you analyze them and follow a proven process to document them and provide deliverables for both the business and technical needs.

When I say linking I mean creating an “information value chain” relating Business Goals :to Business Questions and breaking them down(decomposing)  them into the following:

  1. Business Goal – Corporate Objectives
  2. Business Question – Question needed for managing meeting the objectives.
  3. Metric – Specific formulas required. (Profit= Revenue – Expenses)
  4. Hierarchies – The order to the fields(attributes) necessary to analyze or drill down on the metrics.(Product, Department, Time)
  5. Dimension s  Natural grouping of attributes relating to each other.(Customer, Name, Address etc…)
  6. Business Matrix – Cross reference or matrix showing relationships between Business Questions, Business Processes, Metric and Dimensions. Comparing you business model to your data model.

 

The methodology for building the information value chain is as follows:

 

metric

 

 

Following this approach as the diagram shows will yield a data model and application architecture that will support answering actual business questions and provide the foundation to continue the path to data governance or to simply hold in place and explore you data to better understand your issues, their impact and then plan and prioritize your next step

Follow the path, pick a “real” goal or measurement , preferably one that matters

After that follow the path in the diagram

 

path