I’d like to offer some advice on transitioning skills and knowledge skills and knowledge who worked hard to retainto include some of the new AI and NLM developments it’s actually less impactful, and better than you may think
Well, I dressed up in a little bit for now let’s talk about prompt engineers You most likely have currenb SME’s or expert on your current data and requirements
First, do you notice I used the term AI scientist Instead of data scientist , A data scientist Currently is actually a AI model scientist and they will help you apply. We are concerned here with how a lot of folks have opinions heuristic no not necessarily fact-based already we are going to suggest some techniques, and provide some mentoring to explore this important factor in AI is proper training we specialize in providing techniques and mentoring in separate information which is Formulated and an opinion and facts or Data, which, cannot change
There is a series of steps involved in preparing for the use of Data inAI and Chat in the form of LLM models. This is not much different and you may have most of the information already gathered in order to properly design the requirements for your model we would collect the phone.it is important to realize the steps are critical, for you have confidence in your models putput which will be your result of integrating, your Word documents, your Presentations,, spreadsheets, and, of course your actual data.
We wiKeaton, Billy, and modeling of information words versus modeling and requirements for data preparation . There is a difference that is extremely important and in line with what you’ve been doing.I know that Data preparation is not glamorous, but in my 20+ years you’ll get nowhere without proper data preparation you can’t AI it you can’t impute it you need to discuss requirements with people and write them down and then execute it The AI will make the legwork, faster. but in the end you’ll have to review it otherwiseit otherwise you may end up needlessly retracing your steps based on Improper preparation I know that Data preparation is not glamorous, but in my 20+ years you’ll get nowhere without proper data preparation you can AI it you can’t imputed you need to discuss requirements with people and write them down and then execute a I will do it. Faster time is the legwork, but in the end you’ll have to review with Stuck you may end up needlessly retracing your steps based on.I know that Data preparation is not glamorous, but in my 20+ years you’ll get nowhere without proper data preparation you can AI it you can’t imputed you need to discuss requirements with people and write them down and then execute a I will do it. Faster time is the legwork, but in the end you’ll have to review with Stuck you may end up needlessly retracing your steps based on. improper preparation. This can be at Floydd by phone, the proper steps.
1. Word document
2. Presentations
3. Spreadsheets
4.Data reports
5. Data quality report for AI preparation
6.Internet
7.Other Sources (Network,Internet or Local)
We have suggested tools/techniques/open source.and suggestions for each of these. Don’t let that bother you, however, is important with today’s capabilities in AI integrate your words your thoughts, your abstraction, and your actual data together in order for you. They’re trustworthy results from your AI.
We will be providing a separate post on each of these and then finally how they come together our point is that the what you’ve been doing to understand and form requires for tradition BI can be reutilized and extend it for AI
With a little guidance, you can actually chat with the information you’ve got in caliber, or any data governance tool and integrate that with your PI Data Warehouse and the Data Marts
This would be pPossible by leveraging new capabilities, not necessarily new vendors no doubt new vendor features are on the horizon but this ability to chat with your data governance information is here now if you have already implemented the quality or MDM, or even Data Governance early adoption and prototyping of AI is possible today. We can enable this very quickly and easily by leveraging, our current, capabilities, and knowledge and tools. Who is B) in vomiting and Luigi current Lenckee or Microsoft copilot capabilities along with LLM and provide you the ability to create your own. LLM privately and insecure, basically, LM is what provides chat capability but this time with your personal data in addition, we have exceptional data quality capabilities, which can also be enhanced for you This capability will be taking traditional BI, and data governance, as well as data quality and MDM to explosive to new heights Finally, we have a decade of experience. This is merely extension of the Information Value Chain methodologies. Which we can gladly help you take advantage of
We in IT have complicated and diluted the concept and process of analyzing data and business metrics incredibly in the last few decades. We seem to be focusing on the word data.
“There is a subtle difference between data and information.”
There is a subtle difference between data and information. Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.
The history of temperature readings all over the world for the past 100 years is data.
If this data is organized and analyzed to find that global temperature is rising, then that is information.
The number of visitors to a website by country is an example of data.
Finding out that traffic from the U.S. is increasing while that from Australia is decreasing is meaningful information.
Often data is required to back up a claim or conclusion (information) derived or deduced from it.
For example, before a drug is approved by the FDA, the manufacturer must conduct clinical trials and present a lot of data to demonstrate that the drug is safe.
Because data needs to be interpreted and analyzed, it is quite possible — indeed, very probable — that it will be interpreted incorrectly. When this leads to erroneous conclusions, it is said that the data are misleading. Often this is the result of incomplete data or a lack of context.
For example, your investment in a mutual fund may be up by 5% and you may conclude that the fund managers are doing a great job. However, this could be misleading if the major stock market indices are up by 12%. In this case, the fund has underperformed the market significantly.
Synthesis: the combining of the constituent elements of separate material or abstract entities into a single or unified entity ( opposed to analysis, ) the separating of any material or abstract entity into its constituent elements.
And with the simple action of linking data file metadata names to a businesses glossary or terms, Will result in deeply insightful and informative business insight and analysis.
“Analysis the separating of any material or abstract entity into its constituent elements”
In order for a business manager for analysis you need to be able to start the analysis at a understandable business terminology.
And then provide the manager with the ability to decompose or break apart the result.
They are three essential set of capabilities and associated techniquestechniques for analysis and lineage.
I have been in this business over 45 years and I’d like to offer one example of the power of the concept of a meta-data mart and lineage as it regards to business insight.
A lineage, information and data story for BCBS
I was called on Thursday and told to attend a meeting on Friday between our companies leadership and the new Chief Analytics Officer. He was prototypical of the new IT a “new school” IT Director.
I had been introduced via LinkedIn to this director a week earlier as he had followed one of my blogs on metadata marts and lineage.
After a brief introduction, our leadership began to speak and the director immediately held up his hand he said “Please don’t say anything right now the profiling you provided me is at the kindergarten level and you are dishonest”
The project was a 20 week $900,000 effort and we were in week 10.
The company has desired to do a proof of concept and better understand the use of the informatics a tool DQ as well as direction for a data governance program.
To date what had been accomplished was in a cumulation of hours of effort in billing that has not resulted in any tangible deliverable.
The project had focused on the implementation and functionally of the popular vendor tool, canned data profiling results and not providing information to the business.
The director commented on my blog post and asked if we could achieve that at his company, I of course said yes.
Immediately I proposed we use the methodology that would allow us to focus on a tops down process of understanding critical business metrics and a bottoms up process of linking data to business terms.
My basic premise was that unless your deliverable from a data quality project can provide you business insight from the top down it is of little value. In essence you’ll spend $900,000 to tell a business executive they have dirty data. At which point he will say to you “so what’s new”.
The next step was to use the business terminology glossary that existed in informatica metadata manager and map those terms to source data columns and source systems, not an extremely difficult exercise. However this is the critical step in providing a business manager the understanding and context of data statistics.
The next step, was the crucial step in which we made a slight modification to the IDQ tool and allowed the storing of the profiling results into a meta-data mart and the association of a business dimension from the business glossary the reporting statistics.
We were able to populate my predefined metadata mart dimensional model by using the tool the company and already purchased.
Lastly by using a dimensional model we were able to allow the business to apply their current reporting tool.
Upon realizing the issues they faced in their business metrics, they accelerated the data governance program and canceled the data lake until a future date.
Within six weeks we provided an executive dashboard based on a meta-data mart that allowed the business to reassess their plans involving governance and a data lake.
Here are some of the results of their ability to analyze their basic data statistics but mapped to their business terminology.
6000 in properly form SS cents
35,000 dependence of subscribers over 35 years old
Thousands of charges to PPO plans out of the counties they were restricted to.
There were mysterious double counts in patient eligibility counts, managers were now able to drill into those accounts by source system and find that a simple Syncsort utility had been used improperly and duplicated records.
I was heavily involved in business intelligence, data warehousing and data governance as of several years ago and recently have had many chaotic personal challenges, upon returning to professional practice I have discovered things have not changed that much in 10 yearsagovernance The methodologies and approaches are still relatively consistent however the tools and techniques have changed and In my opinion not for the better, without focusing on specific tools I’ve observed that the core to data or MDM is enabling and providing a capability for classifying data into business categories or nomenclature.. and it has really not improved.
This basic traditional approach has not changed, in essence man AI model predicst a Metric and is wholly based on the integrity of its features or Dimensions.
Therefore I decided, to update some of the techniques and code patterns, I’ve used in the past regarding the information value chain and or record linkage , and we are going to make the results available with associated business and code examples initially with SQL Server and data bricks plus python
My good friend, Jordan Martz of DataMartz fame has greatly contrinuted to this old mans BigData enlightenment as well as Craig Campbell in updating some of the basic classification capabilities required and critical for data governance. If you would like a more detailed version of the source as well as the test data, please send me an email at iwhiteside@msn.com. Stay tuned for more update and soon we will add Neural Network capability for additional automation of “Governance Type” automated classification and confidence monitoring.
Before we focus on functionality let’s focus on methodology
Initially understand key metrics to be measured/KPI‘s their formulas and of course teh businesse’s expectation of their calculations
Immediately gather file sources and complete profiling as specified in my original article found here
Implementing the processes in my meta-data mart article would provide numerous statistics regarding integers or float field however there are some special considerations for text fields or smart codes
Before beginning classification you would employ similarity matching or fuzzy matching as described here
As I said I posted the code for this process on SQL Server Central 10 years ago here is s Python Version.
databricks-logo Roll You Own – Python Jaro_Winkler(Python)
databricks-logoroll You Own – Python Jaro_Winkler(Python) Import Notebook Step 1a – import pandas import pandas
Step 2 – Import Libraries
libraries from pyspark.sql.functions import input_file_name
from pyspark.sql.types import * import datetime, time, re, os, pandas
ML Libraires
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, NGram, HashingTF, IDF, Word2Vec, Normalizer, Imputer, VectorAssembler from pyspark.ml import Pipeline
import mlflow
from mlflow.tracking import MLFlowClient
from sklearn.cluster import KMeans import numpy as np Step 3 – Test JaroWinkler
while (iCounter <= len_str2): df=pandas.DataFrame([[iCounter,0]], columns=column_names) df_temp_table2=pandas.concat([df_temp_table2,df], ignore_index=True) iCounter=iCounter+1
iCounter=1 m=round((max_len/2)-1,0) i=1
while(i <= len_str1): a1=str1_in[i-1]
if m >= i:
f=1
z=i+m
else:
f=i-m
z=i+m
if z > max_len:
z=max_len
while (f <= z):
a2=str2_in[int(f-1)]
if(a2==a1 and df_temp_table2.loc[f-1].at['FStatus']==0):
common = common + 1
df_temp_table1.at[i-1,'FStatus']=1
df_temp_table2.at[f-1,'FStatus']=1
break
f=f+1
i=i+1
%sql DROP TABLE IF EXISTS NameAssociative; CREATE TABLE NameAssociative;
SELECT Name ,NameInput
,sha2(replace ( NameLookput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameLookupCeaned ,a.NameLookupKey ,sha2(replace( NameInput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameInput,b.NameInputKey ,JaroWinkler(a.NameLookup, b.NameInput) MatchScor ,RANK() OVER (Partition by a.DetailedBUMaster ORDER BY JaroWinkler(a.NameLookupCleande, b.NameInputCleaned) DESC) NameLookup,b.NameLookupKey FROM NameInput as a CROSS JOIN NameLookup as b
The challenge for businesses is to seek answers to questions, they do this with Metrics (KPI’s) and know the relationships of the data, organized by logical categories(dimensions) that make up the result or answer to the question. This is what constitutes the Information Value Chain
Navigation
Let’s assume that you have a business problem, a business question that needs answers and you need to know the details of the data related to the business question.
Information Value Chain
Information Value Chain
Information Value Chain
• Business is based on Concepts.
• People thinks in terms of Concepts.
• Concepts come from Knowledge.
• Knowledge comes from Information.
• Information comes from Formulas.
• Formulas determine Information relationships based on quantities.
• Quantities come from Data.
• Data physically exist.
In today’s fast-paced high-tech business world this basic navigation (drill thru) business concept is fundamental and seems to be overlooked, in the zeal to embrace modern technology
In our quest to embrace fresh technological capabilities, a business must realize you can only truly discover new insights when you can validate them against your business model or your businesses Information Value Chain, that is currently creating your information or results.
Today data needs to be deciphered into information in order to apply formulas to determine relationships and validate concepts, in real time.
We are inundated with technical innovations and concepts it’s important to note that business is driving these changes not necessarily technology
Business is constantly striving for a better insights, better information and increased automation as well as the lower cost while doing these things several of these were examined and John Thuma’s‘ latest article
Historically though these changes were few and far between however innovation in hardware storage(technology) as well as software and compute innovations have led to a rapid unveiling of newer concepts as well as new technologies
Demystifying the path forward.
In this article we’re going to review the basic principles of information governance required for a business measure their performance. As well as explore some of the connections to some of these new technological concepts for lowering cost
To a large degree I think we’re going to find that why we do things has not changed significantly it’s just how, we now have different ways to do them.
It’s important while embracing new technology to keep in mind that some of the basic concepts, ideas, goals on how to properly structure and run a business have not changed even though many more insights and much more information and data is now available.
My point is in the implementing these technological advances could be worthless to the business and maybe even destructive, unless they are associated with a actual set of Business Information Goals(Measurements KPI’s) and they are linked directly with understandable Business deliverables.
And moreover prior to even considering or engaging a data scienkce or attempt data mining you should organize your datasets capturing the relationships and apply a “scoring” or “ranking” process and be able to relate them to your business information model or Information Value Chain, with the concept of quality applied real time.
The foundation for a business to navigate their Information Value Chain is an underlying Information Architecture. An Information Architecture typically, involves a model or concept of information that is used and applied to activities which require explicit details of complex information systems.
Subsequently a data management and databases are required, they form the foundation of your Information Value Chain, to bring this back to the Business Goal. Let’s take a quick look at the difference between relational database technology and graph technology as a part of emerging big data capabilities.
However, considering the timeframe for database technology evolution, has is introduced a cultural aspect of implementing new technology changes, basically resistance to change. Business that are running there current operations with technology and people form the 80s and 90s have a different perception of a solution then folks from the 2000s.
Therefore, in this case regarding a technical solution “perception is not reality”, awarement is. Business need to find ways to bridge the knowledge gap and increase awarement that simply embracing new technology will not fundamentally change the why a business is operates , however it will affect how.
Relational databases were introduced in 1970, and graph database technology was introduced in the mid to 2000
There are many topics included in the current Big Data concept to analyze, however the foundation is the Information Architecture, and the databases utilized to implement it.
There were some other advancements in database technology in between also however let’s focus on these two
History
1970
In a 1970s relational database, Based on mathematical Set theory, you could pre-define the relationship of tabular (tables) , implement them in a hardened structure, then query them by manually joining the tables thru physically naming attributes and gain much better insight than previous database technology however if you needed a new relationship it would require manual effort and then migration of old to new , In addition your answer it was only good as the hard coding query created
Early Relationship Conepts
2020
In mid-2000’s the graph database was introduced , based on graph theory, that defines the relationships as tuples containing nodes and edges. Graphs represent things and relationships events describes connections between things, which makes it an ideal fit for a navigating relationship. Unlike conventional table-oriented databases, graph databases (for example Neo4J, Neptune) represent entities and relationships between them. New relationships can be discovered and added easily and without migration, basically much less manual effort.
Nodes and Edges
Graphs are made up of ‘nodes’ and ‘edges’. A node represents a ‘thing’ and an edge represents a connection between two ‘things’. The ‘thing’ in question might be a tangible object, such as an instance of an article, or a concept such as a subject area. A node can have properties (e.g. title, publication date). An edge can have a type, for example to indicate what kind of relationship the edge represents.
Current Relationship Concepts
Takeaway.
The takeaway there are many spokes on the cultural wheel, in a business today, encompassing business acumen, technology acumen and information relationships and raw data knowledge and while they are all equally critical to success, the absolute critical step is that the logical business model defined as the Information Value Chain is maintained and enhanced.
It is a given that all business desire to lower cost and gain insight into information, it is imperative that a business maintain and improve their ability to provide accurate information that can be audited and traceable and navigate the Information Value Chain,Data Science can only be achieved after a business fully understand their existing Information Architecture and strive to maintain it.
Note as I stated above an Information Architecture is not your Enterprise Architecture. Information architecture is the structural design of shared information environments; the art and science of organizing and labelling websites, intranets, online communities and software to support usability and findability; and an emerging community of practice focused on bringing principles of design, architecture and information science to the digital landscape. Typically, it involves a model or concept of information that is used and applied to activities which require explicit details of complex information systems.
Ancient Relationship Concepts
In essence, a business needs a Rosetta stone in order translate past, current and future results.
In future articles we’re going to explore and dive into how these new technologies can be utilized and more importantly how they relate to Business outcomes
The primary take away from this article will be that you don’t start your Machine Learning project, MDM , Data Quality or Analytical project with “data” analysis, you start with the end in mind, the business objective in mind. We don’t need to analyze data to know what it is, it’s like oil or water or sand or flour.
Unless we have a business purpose to use these things, we don’t need to analyze them to know what they are. Then because they are only ingredients to whatever we’re trying to make. And what makes them important is to what degree they are part of the recipe , how they are associated
Business Objective: Make Desert
Business Questions: The consensus is Chocolate Cake , how do we make it?
Business Metrics: Baked Chocolate Cake
Metric Decomposition: What are the ingredients and portions?
2/3 cup butter, softened
1-2/3 cups sugar
3 large eggs
2 cups all-purpose flour
2/3 cup baking cocoa
1-1/4 teaspoons baking soda
1 teaspoon salt
1-1/3 cups milk
Confectioners’ sugar or favorite frosting
So here is the point you don’t start to figure out what you’re going to have for dessert by analyzing the quality of the ingredients. It’s not important until you put them in the context of what you’re making and how they relate in essence, or how the ingredients are linked or they are chained together.
In relation to my example of desert and a chocolate cake, an example could be, that you only have one cup of sugar, the eggs could’ve set out on the counter all day, the flour could be coconut flour , etc. etc. you make your judgment on whether not to make the cake on the basis of analyzing all the ingredients in the context of what you want to, which is a chocolate cake made with possibly warm eggs, cocunut flour and only one cup of sugar.
Again belaboring this you don’t start you project by looking at a single entity column or piece of data, until you know what you’re going to use it for in the context of meeting your business objectives.
Applying this to the area of machine learning, data quality and/or MDM lets take an example as follows:
Business Objective: Determine Operating Income
Business Questions: How much do we make, what does it cost us.
Business. Metrics: Operating income = gross income – operating expenses – depreciation – amortization.
Metric Decomposition: What do I need to determine a Operating income?
Gross Income = Sales Amount from Sales Table, Product, Address
Operating Expense = Cost from Expense Table, Department, Vendor
Etc…
Dimensions to Analyze for quality.
Product
Address
Department
Vendor
You may think these are the ingredients for our chocolate cake in regards to business and operating income however we’re missing one key component, the portions or relationship, in business, this would mean the association,hierarchy or drill path that the business will follow when asking a question such as why is our operating income low?
For instance the CEO might first ask what area of the country are we making the least amount of money?
After that the CEO may ask well in that part of the country, what product is making the least amount of money and who manages it, what about the parts suppliers?
Product => Address => Department => Vendor
Product => Department => Vendor => Address
Many times these hierarchies, drill downs, associations or relationships are based on various legal transaction of related data elements the company requires either between their customers and or vendors.
The point here is we need to know the relationships , dependencies and associations that are required for each business legal transaction we’re going to have to build in order to link these elements directly to the metrics that are required for determining operating income, and subsequently answering questions about it.
No matter the project, whether we are preparing for developing a machine learning model, building an MDM application or providing an analytical application if we cannot provide these elements and their associations to a metric , we will not have answered the key business questions and will most likely fail.
The need to resolve the relationships is what drives the need for data quality which is really a way of understanding what you need to do to standardize your data. Because the only way to create the relationships is with standards and mappings between entities.
The key is mastering and linking relationships or associations required for answering business questions, it is certainly not just mastering “data” with out context.
We need MASTER DATA RELATIONSHIP MANAGEMENT
not
MASTER DATA MANAGEMENT.
So final thoughts are the key to making the chocolate cake is understanding the relationships and the relative importance of the data/ingredients to each other not the individual quality of each ingredient.
This also affects the workflow, Many inexperienced MDM Data architects do not realize that these associations form the basis for the fact tables in the analytical area. These associations will be the primary path(work flow) the data stewards will follow in performing maintenance , the stewards will be guided based on these associations to maintain the surrounding dimensions/master entities. Unfortunately instead some architects will focus on the technology and not the business. Virtually all MDM tools are model driven APIs and rely on these relationships(hierarchies) to generate work flow and maintenance screen generation. Many inexperienced architects focus on MVP(Minimal Viable Product), or technical short term deliverable and are quickly called to task due to the fact the incurred cost for the business is not lowered as well as the final product(Chocolate Cake) is delayed and will now cost more.
Unless the specifics of questionable quality in a specific entity or table or understood in the context of the greater business question and association it cannot be excluded are included.
An excellent resource for understanding this context can we found by following: John Owens
Final , final thoughts, there is an emphasis on creating the MVP(Minimal Viable Product) in projects today, my take is in the real world you need to deliver the chocolate cake, simply delivering the cake with no frosting will not do,in reality the client wants to “have their cake and eat it too”.
Note:
Operating Income is a synonym for earnings before interest and taxes (EBIT) and is also referred to as “operating profit” or “recurring profit.” Operating income is calculated as: Operating income = gross income – operating expenses – depreciation – amortization.
This post will combine business needs with current capabilities for scoring data sets/tables large or small consistently. The methodology and approach we’re going to discuss was implemented with AWS Athena and QuickSight.
A few years ago we put together an article on simple data profiling in a simple sql environment.
We’ve had the occasion to use this method at every client we,ve been to in the last decade.
A very important concept in big data, data mining IT integration and analytics projects is the idea of analyzing, surveying profiling your data and knowing what you’re going to have deal with before, you have to deal with it in a one off fashion
Why Score data , first a discussion of the business value of why this matters.
Let’s assume that you have a business problem, a business question that needs answers and you need to know the details.
There is a predisposition today to call a data scientist.
The problem is you don’t need to know the science behind the data, you need to know, the information that can be derived from the data
Business is based on Concepts.
People thinks in terms of Concepts.
Concepts come from Knowledge.
Knowledge comes from Information.
Information comes from Formulas.
Formulas determine Information relationships based on Quantities.
Quantities come from Data.
Data physically exist.
Data needs to be deciphered into information in order to apply formulas to determine relationships and validate concepts.
My point is proving these low level concepts is probably worthless to the business and maybe even destructive, unless they are associated with a actual set of Business Goals or Measurements and they are linked directly with understandable Business deliverables.
And moreover prior to even considering or engaging a data scientist or attempt data mining you should process your datasets through a ‘Scoring” process.
Creating the information value chain through linking and mapping
When we say linking we mean creating an “information value chain” relating Business Goals to Business Questions and breaking them down(decomposing) them into the following:
Business Goal – Corporate Objectives
Business Question – Question needed for managing meeting the objectives.
Key Performance Indicators(KPI’s) – Specific formulas required. (Average Margin by Product)
Categories or Natural grouping of attributes relating to each other.(Customer, Name, Address etc…)
Drill Paths – The order to the fields(attributes) necessary to analyze or drill down on the metrics.(Product=>Geography=>Organization=>Time or Geography=>Organization=>Product=>Time)
Business Matrix – Cross reference or matrix showing relationships between Business Questions, Business Processes, KPIs, and Categories. Linking your business model, to required information and finally to data
About scoring data
There are two things necessary to Score a data set or file universally.
File Scoring: Overall combined score of all columns based on standards data profiling measures(count of nulls, minimum value, maximum value, mode value, total records, unique count, etc…)
Column Scoring: measures for each column or domain including but not limited to frequency of distinct values, patterns.
These two measure will combine to provide a “Score” and the necessary detail to analyze the results in detail and predict corrective methods and fitment for possible feature engineering for machine learning models as well as general data quality understanding.
Back to how we got here:
Our projects were standard data warehouse, integration, MDM old school projects.
Nonetheless they required a physical or virtual server, a database and usually they were limitations on the amount of data that could be profiled due to these constraints.
However the problems we faced then and that our clients faced are the same problems they face today only exponentially bigger.
Fast forward 10 years. No Database, No Server, No Limitations sizewize, and pay for only what you use.
This has only been possible since last year it was not due virtual machines, or big data.
It is due to the advent of serverless sql, as well as other serverless technologies
We created a few diagrams below to try to illustrate the concepts both from a logical and physical perspective
The Flow
As you can see below most projects begin with the idea of ingesting data transforming it or summarizing it in some manner and creating reports as fast as possible.
Scoring
What I find most often is that the source analysis, the matching of source data content contextualization to business requirements and expected outcome is usually last, or discovered during the build. This requires the inclusion and creation of a meta-data mart and the ability to provide scoring and analysis on the contextual of data agnostic way, put simply scoring the data or content , Purely on the data, prior to trying to use it for business intelligence, reporting, visualizations and or most importantly data mining.
As you can see below we need to insert simple processes Conceptionally the same as outlined in my original blog in order to gain information and insight and predictability in providingi business intelligence output , visualization or featuring engineering for possible data mining.
AWS Athena
So let’s explore what we can accomplish now without a server without a database.
We have implemented my own version of the capabilities for profiling using Athena the new AWS offering as well as a few other tools.
AWS and in particular Athena will provide the capability to accomplish all these things on demand and pay as we go
Athena Scoring
We plan to follow this article up with detailed instructions on how to accomplish this however as you can see it is using all the existing standard tools for the S3, Athena and Glue,Lambda.
QuickSight
Well actually I spent some time in the QuickSight tool and below example is a pathetic example of a dashboard it’s just to show you real quick the idea of looking at aggregate patterns of data and details of data visually and literally after only a few minutes of effort.
Here a quick example looking at Phone1 column patterns for my customer dataset. Notice the majority of records is in the 999-999-9999 pattern for 89%.
Then we add a Field and do a quick drill into the pattern representing 14% or 9999999999.
Now we can see the actual numbers the make up the 14%
Another fast change and we are in a Bar chart.
I know these are very simple visuals however the last point I’d like to make is that this experiment took us about 4 hours and we did on an ipad Pro, including signing up and creating an AWS account.
Last word be business focused
Last word on connecting the business objectives to the data after scoring it. This may seem like a cumbersome process, however unless you link the informational and abstract concepts directly to data columns or fields you will never have a true understanding of information based on your data. This mapping is the key, the Rosetta stone for deciphering you data,
The secret to implementation of data quality is to follow the path below, mainly very Business Driven and focused approach extremely iterative and collaborative
The software or tools may change but the logical path, defined by identifying important business measurements required for successful and measurable results will not.
The key is not think of it as some kind of technical POC or tool trial.
It is important to realize “What” you want to measure and there by understand will not change, only “How” you create the result will change.
While many organizations are led down the path of creating a Data Governance Program, it’s frankly to large of a task, and more importantly cannot adequately be planned, with first implementing a Data Quality program, with analytical capabilities.
For example in the real world if you wanted to drill an oil well, first and before you plan, budget, move or buy equipment you would, survey the land, examine the minerals and drill a test well. This is not the same as in our IT data world as doing a vendor or tool Proof of Concept(POC) or a pilot to see if the vendor product works.
The oil company know exactly how there equipment works and the processes they will follow, they are trying to determine “where” to drill , not “how” to drill .
In our world , the IT WORLD, we act as if we need to “somehow” complete a “proof of concept” without really know exactly what concept we’re proving.
Are we proving the tool works, are we proving our data has errors or our processes are flawed in essence we verifying that if we find bad data we want to fix them or the data, none of these concepts need “proving”.
My point is proving these low level concepts is probably worthless to the business and maybe even destructive, unless they are associated with a actual set of Business Goals or Measurements and they are linked directly with understandable Business deliverables. This is my way of saying put this information in an organized set of spreadsheets linking business metrics, required fields and the order you analyze them and follow a proven process to document them and provide deliverables for both the business and technical needs.
When I say linking I mean creating an “information value chain” relating Business Goals :to Business Questions and breaking them down(decomposing) them into the following:
Business Goal – Corporate Objectives
Business Question – Question needed for managing meeting the objectives.
Metric – Specific formulas required. (Profit= Revenue – Expenses)
Hierarchies – The order to the fields(attributes) necessary to analyze or drill down on the metrics.(Product, Department, Time)
Dimension s Natural grouping of attributes relating to each other.(Customer, Name, Address etc…)
Business Matrix – Cross reference or matrix showing relationships between Business Questions, Business Processes, Metric and Dimensions. Comparing you business model to your data model.
The methodology for building the information value chain is as follows:
Following this approach as the diagram shows will yield a data model and application architecture that will support answering actual business questions and provide the foundation to continue the path to data governance or to simply hold in place and explore you data to better understand your issues, their impact and then plan and prioritize your next step
Follow the path, pick a “real” goal or measurement , preferably one that matters