DNA and the concept of MDM( Master Data Management) or Modern ML/AI Data Preparation have many similarities.
“There is a subtle difference between data and information.”
We in IT have complicated and diluted the concept and process of analyzing data and business metrics incredibly in the last few decades. We seem to be focus in on the word data.
And if you consider the primary business objectives of MDM is provide consistent answers with standard business definitions and an understanding of the relationship or mappings of business outcomes to data elements.
DNA vs MDM or IT’s version of DNA.
The graphic I’ve chosen for this post symbolizes the linkage or lineage of human beings to DNA. What I’d like to do is relate the importance of lineage of data to examples of human how DNA communicate lineage and then discuss it in our IT as it relates to business functionality
Living Organisms are very complex as is a company and its data or information.
The genetic information of every living organism is stored inside these nucleic the basic data.
There are two types of nucleic acids(NA) namely:
DNA– Defines Traits, Characteristics
RNA – Communication, transfers information and synthesis.
Lets examine them.
DNA- Defines Traits, Characteristics
RNA – Communication, transfers information and synthesis
RNA – can move around in the cells of living organisms and serves as a genetic messenger, passing the information stored in the cell’s DNA from the nucleus to other parts of the cell for protein synthesis.
So here goes this is a bit of a stretch but if you consider DNA is the “Data” and a person the “Information” is created from the communication through the RNA.
To continue the analogy the DNA or chromosomes in and by themselves are out of context. It’s only once they been passed from one person to the next driven by RNA and result in a human being in that they become contextually realized as a human.
Again if we break down human DNA and inspected it, we can tell many things origin, or ancestory , traits of the person possible, diseases of the person, but it needs to be processed for us to understand the actual person.
So my point in relation to various approaches to traditional MDM or Master Data Management yis that this(DNA) is how life is created, it’s science, it’s not a methodology or product in approach a vendor guess work it’s real and the main point is “lineage” is the key
“There is a subtle difference between data and information. Data are the facts or details from which is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.”
Examples of Data and Information as it relates to MDM
The history of temperature readings all over the world for the past 100 years is data.
If this data is organized and analyzed to find that global temperature is rising, then that is information.
- The number of visitors to a website by country is an example of data.
- Finding out that traffic from the U.S. is increasing while that from Australia is decreasing is meaningful information.
- Often data is required to back up a claim or conclusion (information) derived or deduced from it.
- For example, before a drug is approved by the FDA, the manufacturer must conduct clinical trials and present a lot of data to demonstrate that the drug is safe.
Data needs to be interpreted and analyzed, it is quite possible — indeed, very probable — that it will be interpreted incorrectly.
When this leads to erroneous conclusions, it is said that the data are misleading. Often this is the result of incomplete data or a lack of context.
For example, your investment in a mutual fund may be up by 5% and you may conclude that the fund managers are doing a great job. However, this could be misleading if the major stock market indices are up by 12%. In this case, the fund has underperformed the market significantly.
“Synthesis: the combining of the constituent elements of separate material or abstract entities into a single or unified entity ( opposed to analysis, ) the separating of any material or abstract entity into its constituent elements.”
Synthesis in MDM
Communication, transfers information and synthesis, like RNA several companies are prevelant in terms of data movement and replication, in essence data logistics
Defines Traits, Characteristics like DNA many companies have developed and refined products and the techniques required for during these. They are data profiling , domain pattern profiling and record linkage, the basis of transforming data into information .
However lineage is the key, and without this to serve as a connection between data and information, in essence there is no information. And in this case the “information” is the business term from the business glossary.
The integration of data movement capability and the linking of data profiling capabilities can result in providing a business the capabilities answering business question with certitude through the transparency of lineage.
To translate this into “business terms” it’s very similar to providing an audit trail in that a business can ask this questions like what customers are the most profitable look at those customers and then drill into wood products what areas with those profits coming
What lineage accomplishes is to lay a trail of cookie crumbs for data movement but for your business questions, it’s simply makes sure that you can connect the dots as data gets moved and/or translated and/or standardize and or cleansed throughout your enterprise.
And with the simple action of linking data file metadata of files, columns , profiling results to a businesses glossary or business terms, will result in deeply insightful and informative business insight and analysis.
“Analysis the separating of any material or abstract entity into its constituent elements”
In order for a business manager for analyze you need to be able to start the analysis at a understandable business terminology level. And then provide the manager with the ability to decompose or understand lineage from a logical perspective.
There are three essential capabilities required for analysis and utilizing lineage to answer business questions via a meta-data mart and these are very similar to the pattern that exist in DNA.
1. Data profiling and domain(column) analysis as well as fuzzy matching processes that are available in many forms:
a. – Scan all the values within each column and provide statistics(counts) such as minimum value, maximin value, mode(most occuring) value, number of missing value etc…
b. Frequency or Column value Patterns – determine the counts of distinct values within a column and also identify the distinct pattern occurring for all values within a single column and associated counts ie(SSN = 123-45-6789 , SSN PATTERN = ‘999-99-9999’
c. Fuzzy Matching( similarity algorithms) – This capability enables the find the counts of duplicate or similar text values.
2. The results need to be stored is a “Metadata-mart” in order to see the patterns, results and associations providing lineage and retatingraw data to business terms and hierarchies.
3. Visualization and analysis capability to allow for analysis, drill down into data mart aggregated and statistical contents and associated business hierarchies and businessterminology
Underlying each of these analytical capabilities is a set of refined processes, developed and proven code for accomplishing these basic fundamental task.
In future post I will describe how to implement these capabilities with or without vendor products, from a logical perspective.