You get what you inspect, not what you expect!
This post will combine business needs with current capabilities for scoring data sets/tables large or small consistently. The methodology and approach we’re going to discuss was implemented with AWS Athena and QuickSight.
A few years ago we put together an article on simple data profiling in a simple sql environment.
We’ve had the occasion to use this method at every client we,ve been to in the last decade.
A very important concept in big data, data mining IT integration and analytics projects is the idea of analyzing, surveying profiling your data and knowing what you’re going to have deal with before, you have to deal with it in a one off fashion
Why Score data , first a discussion of the business value of why this matters.
Let’s assume that you have a business problem, a business question that needs answers and you need to know the details.
There is a predisposition today to call a data scientist.
The problem is you don’t need to know the science behind the data, you need to know, the information that can be derived from the data
Business is based on Concepts.
People thinks in terms of Concepts.
Concepts come from Knowledge.
Knowledge comes from Information.
Information comes from Formulas.
Formulas determine Information relationships based on Quantities.
Quantities come from Data.
Data physically exist.
Data needs to be deciphered into information in order to apply formulas to determine relationships and validate concepts.
My point is proving these low level concepts is probably worthless to the business and maybe even destructive, unless they are associated with a actual set of Business Goals or Measurements and they are linked directly with understandable Business deliverables.
And moreover prior to even considering or engaging a data scientist or attempt data mining you should process your datasets through a ‘Scoring” process.
Creating the information value chain through linking and mapping
When we say linking we mean creating an “information value chain” relating Business Goals to Business Questions and breaking them down(decomposing) them into the following:
- Business Goal – Corporate Objectives
- Business Question – Question needed for managing meeting the objectives.
- Key Performance Indicators(KPI’s) – Specific formulas required. (Average Margin by Product)
- Categories or Natural grouping of attributes relating to each other.(Customer, Name, Address etc…)
- Drill Paths – The order to the fields(attributes) necessary to analyze or drill down on the metrics.(Product=>Geography=>Organization=>Time or Geography=>Organization=>Product=>Time)
- Business Matrix – Cross reference or matrix showing relationships between Business Questions, Business Processes, KPIs, and Categories. Linking your business model, to required information and finally to data
About scoring data
There are two things necessary to Score a data set or file universally.
File Scoring: Overall combined score of all columns based on standards data profiling measures(count of nulls, minimum value, maximum value, mode value, total records, unique count, etc…)
Column Scoring: measures for each column or domain including but not limited to frequency of distinct values, patterns.
These two measure will combine to provide a “Score” and the necessary detail to analyze the results in detail and predict corrective methods and fitment for possible feature engineering for machine learning models as well as general data quality understanding.
Back to how we got here:
Our projects were standard data warehouse, integration, MDM old school projects.
Nonetheless they required a physical or virtual server, a database and usually they were limitations on the amount of data that could be profiled due to these constraints.
However the problems we faced then and that our clients faced are the same problems they face today only exponentially bigger.
Fast forward 10 years. No Database, No Server, No Limitations sizewize, and pay for only what you use.
This has only been possible since last year it was not due virtual machines, or big data.
It is due to the advent of serverless sql, as well as other serverless technologies
We created a few diagrams below to try to illustrate the concepts both from a logical and physical perspective
As you can see below most projects begin with the idea of ingesting data transforming it or summarizing it in some manner and creating reports as fast as possible.
What I find most often is that the source analysis, the matching of source data content contextualization to business requirements and expected outcome is usually last, or discovered during the build. This requires the inclusion and creation of a meta-data mart and the ability to provide scoring and analysis on the contextual of data agnostic way, put simply scoring the data or content , Purely on the data, prior to trying to use it for business intelligence, reporting, visualizations and or most importantly data mining.
As you can see below we need to insert simple processes Conceptionally the same as outlined in my original blog in order to gain information and insight and predictability in providingi business intelligence output , visualization or featuring engineering for possible data mining.
So let’s explore what we can accomplish now without a server without a database.
We have implemented my own version of the capabilities for profiling using Athena the new AWS offering as well as a few other tools.
AWS and in particular Athena will provide the capability to accomplish all these things on demand and pay as we go
We plan to follow this article up with detailed instructions on how to accomplish this however as you can see it is using all the existing standard tools for the S3, Athena and Glue,Lambda.
Well actually I spent some time in the QuickSight tool and below example is a pathetic example of a dashboard it’s just to show you real quick the idea of looking at aggregate patterns of data and details of data visually and literally after only a few minutes of effort.
Here a quick example looking at Phone1 column patterns for my customer dataset. Notice the majority of records is in the 999-999-9999 pattern for 89%.
Then we add a Field and do a quick drill into the pattern representing 14% or 9999999999.
Now we can see the actual numbers the make up the 14%
Another fast change and we are in a Bar chart.
I know these are very simple visuals however the last point I’d like to make is that this experiment took us about 4 hours and we did on an ipad Pro, including signing up and creating an AWS account.
Last word be business focused
Last word on connecting the business objectives to the data after scoring it. This may seem like a cumbersome process, however unless you link the informational and abstract concepts directly to data columns or fields you will never have a true understanding of information based on your data. This mapping is the key, the Rosetta stone for deciphering you data,