Tag Archives: data lineage

Synthesizing AI & BI

Integrating Business Intelligence (BI) and Artificial Intelligence (AI) is reshaping the landscape of data analytics and business decision-making. This comprehensive analysis explores the synergy between BI and AI, how AI enhances BI capabilities and provides case examples of their integration.

BI and AI, though distinct in their core functionalities, complement each other in enhancing business analytics. BI focuses on descriptive analytics, which involves analyzing historical data to understand trends, outcomes, and business performance. AI, particularly ML, brings predictive and prescriptive analytics, focusing on future predictions and decision-making recommendations.

Artificial Intelligence (AI), primarily through Machine Learning (ML) and Natural Language Processing (NLP), significantly bolsters the capabilities of Business Intelligence (BI) systems. AI algorithms process and analyze large and diverse data sets, including unstructured data like text, images, and voice recordings. This advanced data processing capability dramatically expands the scope of BI, enabling it to derive meaningful insights from a broader array of data sources. Such an enhanced data processing capability is pivotal in today’s data-driven world, where the volume and variety of data are constantly increasing.

Real-time analytics, another critical feature AI enables in BI systems, provides businesses with immediate insights. This feature is particularly beneficial in dynamic sectors like finance and retail, where conditions fluctuate rapidly, and timely data can lead to significant competitive advantages. By integrating AI, BI tools can process and analyze data as it’s generated, allowing businesses to make informed decisions swiftly. This ability to quickly interpret and act on data can be a game-changer, particularly when speed and agility are crucial.

Morеovеr, AI еnhancеs BI with prеdictivе modеling and NLP. Prеdictivе modеls in AI utilizе historical data to forеcast futurе еvеnts, offеring forеsight prеviously unattainablе with traditional BI tools. This prеdictivе powеr transforms how businеssеs stratеgizе and plan, moving from rеactivе to proactivе dеcision-making. NLP furthеr rеvolutionizеs BI by еnabling usеrs to interact with BI tools using natural languagе. This advancement makes data analytics more accessible to those without technical expertise, broadening the applicability of BI tools across various organizational levels. Integrating NLP democratizes data and enhances user engagement with BI tools, making data-driven insights a part of everyday business processes.

Full Book coming in September

Data Mind Set interview Information Value Chain

Leave a reply

Practical artificial intelligence and it’s application to traditional business intelligence

Leave a reply

There are many books and photos that talk about the so-called revolution of artificial intelligence or AI I appreciate their enthusiasm however, I don’t think that is really being recognized that much of what they have invented i can be used in reality to greatly, reduce the time and effort to creat the kind of things we’ve had a creat for decades And they are essential to the creation of AI, which is predicated on learning and contact

One of the things that we will address is, there is a proclivity to use much language much of it is not very semantic. Today folks are describing architecture that already have descriptions. We have a mix of marketing terms and creative terms that mean the same thing, but they use words from the past. It’s causing confusion.

Just one example, maybe the term dimensionis one thing it meant to me thing 10 years ago come today it’s being used in a different context, in AI is it’s dimension or parameter or a feature There are many people of many years in a IT language is important it’s semantic

Another is the term similarity . It is used completely different in AI versus traditional, fuzzy matching. true the concept is the same but the Technical use Is different

There is no doubt of the benefit of what’s been created through the use of neural networks and transformers tthat hey can have tremendous positive impact on delivering business intelligence with the aid of artificial intelligence, machine, learning, and Deep Learning subsequently

I have been deeply involved in business intelligence, data quality, data profiling, and MDM and Data Governance for several decades.

I would like to take you on a journey and be able to help you exploit all these capabilities today and yesterday we are experience in the evolution of what we’re doing it is not a revolution. It is an evolution. if anything I hope to help achieve a basic understanding and terminalnology used in information architecture and various techniques that we have that will help , frankly nobody has a corner on the best approach it has all been done before at the logical level I want to be a part of helping us leverage, reuse, and apply what we were doing for decades, to what is now being introduced in the last several years you have to judge among the three goals, cheaper faster, better or we can guarantee is cheaper and faster. It’s up to you to make it better not necessarily the technology.

More to come . . .p

AI LLM New Project Roles including Prompt Engineering and Role AI Scientist and integrated data and information

Leave a reply

I’d like to offer some advice on transitioning skills and knowledge skills and knowledge who worked hard to retainto include some of the new AI and NLM developments it’s actually less impactful, and better than you may think

Well, I dressed up in a little bit for now let’s talk about prompt engineers You most likely have currenb SME’s or expert on your current data and requirements

First, do you notice I used the term AI scientist Instead of data scientist , A data scientist Currently is actually a AI model scientist and they will help you apply. We are concerned here with how a lot of folks have opinions heuristic no not necessarily fact-based already we are going to suggest some techniques, and provide some mentoring to explore this important factor in AI is proper training we specialize in providing techniques and mentoring in separate information which is Formulated and an opinion and facts or Data, which, cannot change

There is a series of steps involved in preparing for the use of Data inAI and Chat in the form of LLM models. This is not much different and you may have most of the information already gathered in order to properly design the requirements for your model we would collect the phone.it is important to realize the steps are critical, for you have confidence in your models putput which will be your result of integrating, your Word documents, your Presentations,, spreadsheets, and, of course your actual data.

We wiKeaton, Billy, and modeling of information words versus modeling and requirements for data preparation . There is a difference that is extremely important and in line with what you’ve been doing.I know that Data preparation is not glamorous, but in my 20+ years you’ll get nowhere without proper data preparation you can’t AI it you can’t impute it you need to discuss requirements with people and write them down and then execute it The AI will make the legwork, faster. but in the end you’ll have to review it otherwiseit otherwise you may end up needlessly retracing your steps based on Improper preparation I know that Data preparation is not glamorous, but in my 20+ years you’ll get nowhere without proper data preparation you can AI it you can’t imputed you need to discuss requirements with people and write them down and then execute a I will do it. Faster time is the legwork, but in the end you’ll have to review with Stuck you may end up needlessly retracing your steps based on.I know that Data preparation is not glamorous, but in my 20+ years you’ll get nowhere without proper data preparation you can AI it you can’t imputed you need to discuss requirements with people and write them down and then execute a I will do it. Faster time is the legwork, but in the end you’ll have to review with Stuck you may end up needlessly retracing your steps based on. improper preparation. This can be at Floydd by phone, the proper steps.

1. Word document

2. Presentations

3. Spreadsheets

4.Data reports

5. Data quality report for AI preparation

6.Internet

7.Other Sources (Network,Internet or Local)

We have suggested tools/techniques/open source.and suggestions for each of these. Don’t let that bother you, however, is important with today’s capabilities in AI integrate your words your thoughts, your abstraction, and your actual data together in order for you. They’re trustworthy results from your AI.

We will be providing a separate post on each of these and then finally how they come together our point is that the what you’ve been doing to understand and form requires for tradition BI can be reutilized and extend it for AI

Evolving Data Into Information through Lineage

Leave a reply

Information lineage through data DNA

Ira Warren Whiteside

ORIGINS
COMMON SENSE

We in IT have complicated and diluted the concept and process of analyzing data and business metrics incredibly in the last few decades. We seem to be focusing on the word data.

“There is a subtle difference between data and information.”

Information vs data

There is a subtle difference between data and information. Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.

Examples of Data and Information

The history of temperature readings all over the world for the past 100 years is data.

If this data is organized and analyzed to find that global temperature is rising, then that is information.

The number of visitors to a website by country is an example of data.

Finding out that traffic from the U.S. is increasing while that from Australia is decreasing is meaningful information.

Often data is required to back up a claim or conclusion (information) derived or deduced from it.

For example, before a drug is approved by the FDA, the manufacturer must conduct clinical trials and present a lot of data to demonstrate that the drug is safe.

“Misleading” Data

Because data needs to be interpreted and analyzed, it is quite possible — indeed, very probable — that it will be interpreted incorrectly. When this leads to erroneous conclusions, it is said that the data are misleading. Often this is the result of incomplete data or a lack of context.

For example, your investment in a mutual fund may be up by 5% and you may conclude that the fund managers are doing a great job. However, this could be misleading if the major stock market indices are up by 12%. In this case, the fund has underperformed the market significantly.

Comparison chart

Synthesis: the combining of the constituent elements of separate material or abstract entities into a single or unified entity ( opposed to analysis, ) the separating of any material or abstract entity into its constituent elements.

Synthesis

Data into Information is dominant in terms of data movement and replication, in essence data logistics.

Lineage is the key.

And with the simple action of linking data file metadata names to a businesses glossary or terms, Will result in deeply insightful and informative business insight and analysis.

“Analysis the separating of any material or abstract entity into its constituent elements”

In order for a business manager for analysis you need to be able to start the analysis at a understandable business terminology.

And then provide the manager with the ability to decompose or break apart the result.

They are three essential set of capabilities and associated techniquestechniques for analysis and lineage.

Data profiling and domain analysis as well as fuzzy matching components available on my blog https://irawarrenwhiteside.com/2014/04/13/creating-a-metadata-mart-via-tsql/
Meta-data driven creation of a meta-data mart through code generation techniques, implemented.

Underlining each of these capabilities is a set of refined, developed and proven code says for accomplishing these basic fundamental task.

One case study

I have been in this business over 45 years and I’d like to offer one example of the power of the concept of a meta-data mart and lineage as it regards to business insight.

A lineage, information and data story for BCBS

I was called on Thursday and told to attend a meeting on Friday between our companies leadership and the new Chief Analytics Officer. He was prototypical of the new IT a “new school” IT Director.

I had been introduced via LinkedIn to this director a week earlier as he had followed one of my blogs on metadata marts and lineage.

After a brief introduction, our leadership began to speak and the director immediately held up his hand he said “Please don’t say anything right now the profiling you provided me is at the kindergarten level and you are dishonest”

The project was a 20 week $900,000 effort and we were in week 10.

The company has desired to do a proof of concept and better understand the use of the informatics a tool DQ as well as direction for a data governance program.

To date what had been accomplished was in a cumulation of hours of effort in billing that has not resulted in any tangible deliverable.

The project had focused on the implementation and functionally of the popular vendor tool, canned data profiling results and not providing information to the business.

The director commented on my blog post and asked if we could achieve that at his company, I of course said yes.

Immediately I proposed we use the methodology that would allow us to focus on a tops down process of understanding critical business metrics and a bottoms up process of linking data to business terms.

My basic premise was that unless your deliverable from a data quality project can provide you business insight from the top down it is of little value. In essence you’ll spend $900,000 to tell a business executive they have dirty data. At which point he will say to you “so what’s new”.

The next step was to use the business terminology glossary that existed in informatica metadata manager and map those terms to source data columns and source systems, not an extremely difficult exercise. However this is the critical step in providing a business manager the understanding and context of data statistics.

The next step, was the crucial step in which we made a slight modification to the IDQ tool and allowed the storing of the profiling results into a meta-data mart and the association of a business dimension from the business glossary the reporting statistics.

We were able to populate my predefined metadata mart dimensional model by using the tool the company and already purchased.

Lastly by using a dimensional model we were able to allow the business to apply their current reporting tool.

Upon realizing the issues they faced in their business metrics, they accelerated the data governance program and canceled the data lake until a future date.

Now for the results.

Within six weeks we provided an executive dashboard based on a meta-data mart that allowed the business to reassess their plans involving governance and a data lake.

Here are some of the results of their ability to analyze their basic data statistics but mapped to their business terminology.

6000 in properly form SS cents
35,000 dependence of subscribers over 35 years old
Thousands of charges to PPO plans out of the counties they were restricted to.
There were mysterious double counts in patient eligibility counts, managers were now able to drill into those accounts by source system and find that a simple Syncsort utility had been used improperly and duplicated records.

Merry Christmas Data Classification, Feature Engineering , Data Governance. ‘How to’ do it and some code take a look

Leave a reply

I was heavily involved in business intelligence, data warehousing and data governance as of several years ago and recently have had many chaotic personal challenges, upon returning to professional practice I have discovered things have not changed that much in 10 yearsagovernance The methodologies and approaches are still relatively consistent however the tools and techniques have changed and In my opinion not for the better, without focusing on specific tools I’ve observed that the core to data or MDM is enabling and providing a capability for classifying data into business categories or nomenclature.. and it has really not improved.

This basic traditional approach has not changed, in essence man AI model predicst a Metric and is wholly based on the integrity of its features or Dimensions.

Therefore I decided, to update some of the techniques and code patterns, I’ve used in the past regarding the information value chain and or record linkage , and we are going to make the results available with associated business and code examples initially with SQL Server and data bricks plus python

My good friend, Jordan Martz of DataMartz fame has greatly contrinuted to this old mans BigData enlightenment as well as Craig Campbell in updating some of the basic classification capabilities required and critical for data governance. If you would like a more detailed version of the source as well as the test data, please send me an email at iwhiteside@msn.com. Stay tuned for more update and soon we will add Neural Network capability for additional automation of “Governance Type” automated classification and confidence monitoring.

Before we focus on functionality let’s focus on methodology

Initially understand key metrics to be measured/KPI‘s their formulas and of course teh businesse’s expectation of their calculations

Immediately gather file sources and complete profiling as specified in my original article found here

Implementing the processes in my meta-data mart article would provide numerous statistics regarding integers or float field however there are some special considerations for text fields or smart codes

Before beginning classification you would employ similarity matching or fuzzy matching as described here

As I said I posted the code for this process on SQL Server Central 10 years ago here is s Python Version.

databricks-logo Roll You Own – Python Jaro_Winkler(Python)

databricks-logoroll You Own – Python Jaro_Winkler(Python)
Import Notebook
Step 1a – import pandas
import pandas

Step 2 – Import Libraries

libraries from pyspark.sql.functions import input_file_name

from pyspark.sql.types import *
import datetime, time, re, os, pandas

ML Libraires

from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, NGram, HashingTF, IDF, Word2Vec, Normalizer, Imputer, VectorAssembler
from pyspark.ml import Pipeline

import mlflow

from mlflow.tracking import MLFlowClient

from sklearn.cluster import KMeans
import numpy as np
Step 3 – Test JaroWinkler

JaroWinkler(‘TRAC’,’TRACE’)
Out[5]: 0.933333
Step 4a =Implement JaroWinkler(Fuzzy Matching)

%python

def JaroWinkler(str1_in, str2_in):

if(str1_in is None or str2_in is None):
return 0

tr=0
common=0
jaro_value=0
len_str1=len(str1_in)
len_str2=len(str2_in)
column_names=[‘FID’,’FStatus’]

df_temp_table1 = pandas.DataFrame(columns=column_names)
df_temp_table2 = pandas.DataFrame(columns=column_names)

#clean_string(str1_in)
#clean_string(str2_in)

if len_str1 > len_str2:
swap_len=len_str2
len_str2=len_str1
len_str1=swap_len
swap_str=str1_in
str1_in=str2_in
str2_in=swap_str

max_len=len_str2

iCounter=1

while(iCounter <= len_str1):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table1=pandas.concat([df_temp_table1,df], ignore_index=True)
iCounter=iCounter+1

iCounter=1

while (iCounter <= len_str2):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table2=pandas.concat([df_temp_table2,df], ignore_index=True)
iCounter=iCounter+1

iCounter=1
m=round((max_len/2)-1,0)
i=1

while(i <= len_str1):
a1=str1_in[i-1]

if m >= i:
  f=1
  z=i+m 
else:
  f=i-m
  z=i+m 
if z > max_len:
  z=max_len
while (f <= z):
  a2=str2_in[int(f-1)]
  if(a2==a1 and df_temp_table2.loc[f-1].at['FStatus']==0):
    common = common + 1
    df_temp_table1.at[i-1,'FStatus']=1
    df_temp_table2.at[f-1,'FStatus']=1 
    break
  f=f+1
i=i+1

i=1
z=1
while(i <= len_str1): v1Status=df_temp_table1.loc[i-1].at[‘FStatus’] if(v1Status==1): while(z <= len_str2): v2Status=df_temp_table2.loc[z-1].at[‘FStatus’] if(v2Status==1): a1=str1_in[i-1] a2=str2_in[z-1] z=z+1 if(a1 != a2 ): tr = tr+0.5 break break i=i+1 wcd = 1.0/3.0 wrd = 1.0/3.0 wtr = 1.0/3.0 if (common != 0): jaro_value = (wcd * common)/ len_str1 + (wrd * common) / len_str2 + (wtr * (common – tr)) / common return round(jaro_value,6) Step 4b – Register JaroWinkler spark.udf.register(“JaroWinkler”, JaroWinkler) Out[6]:
Step8a – Bridge vs Master vs AssociativeALL

%sql
DROP TABLE IF EXISTS NameAssociative;
CREATE TABLE NameAssociative;

SELECT
Name
,NameInput

,sha2(replace ( NameLookput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameLookupCeaned ,a.NameLookupKey
,sha2(replace( NameInput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameInput,b.NameInputKey
,JaroWinkler(a.NameLookup, b.NameInput) MatchScor
,RANK() OVER (Partition by a.DetailedBUMaster ORDER BY JaroWinkler(a.NameLookupCleande, b.NameInputCleaned) DESC) NameLookup,b.NameLookupKey
FROM NameInput as a
CROSS JOIN NameLookup as b

Metadata Mart the Road to Data Governance

2 Replies

or Guerilla Data Governance

Implementing a Metadata Mart the Road to Data Governance best viewed in PRESENTATION Mode, there is animation.

This presentation has narrative, play in presentation mode with sound on.

https://onedrive.live.com/redir?resid=E7059817CA5789D0!10028&authkey=!AE6eWl-gzh2mq_M&ithint=file%2cpptx

Self Service Semantic BI (Business Intelligence) Concept

Leave a reply

We want to build an Enterprise Analytical capability by integrating the concepts for building a Metadata Mart with the facilities for the Semantic Web

Metadata Mart Source
- (Metadata Mart as is) Source Profiling(Column, Domain & Relationship)
+
- (Metadata Mart Plus Vocabulary(Metadata Vocabulary)) Stored as Triples(subject-predicate-object) (SSIS Text Mining)
+
- (Metadata Mart Plus)Create Metadata Vocabulary following RDFa applied to Metadata Mart Triple(SSIS Text Mining+ Fuzzy (SPARGL maybe))
+
Bridge to RDFa – JSON-LD via Schema.org
- Master data Vocabulary with lineage (Metadata Vocabulary + Master Vocabulary) mapped to MetaContent Statements)) based on person.schema.org
- Creates link to legacy data in data warehouse
- +RDFa applied to web pages
- +JSON-LD applied to
- + any Triples from any source

Semantic Self Service BI
- Metadata Mart Source + Bridge to RDFa

I have spent some time in this for quite a while now and I believe there is a quite a bit of merit in approaching the collection of domain data and column profile data, in regards to the meta-data mart, and organize them in a triple’s fashion

The basis for JSON-LD and RDFa is the collection of data as a triple. Delving into said deeper

I believe with the proper mapping for the object reference and deriving of the appropriate predicates in the collection of the value we could gain some of the same benefits as well as bringing the web data being “collected, there by linking to source data.

Consider the following excerpt regarding Vocabularies derived from MetaContent via Metadata Structure

“Metadata structures[edit]

Metadata (metacontent), or more correctly, the vocabularies used to assemble metadata (metacontent) statements, are typically structured according to a standardized concept using a well-defined metadata scheme, including: metadata standards and metadata models. Tools such as controlled vocabularies, taxonomies, thesauri, data dictionaries, and metadata registries can be used to apply further standardization to the metadata. Structural metadata commonality is also of paramount importance in data model development and in database design.

Metadata syntax[edit]

Metadata (metacontent) syntax refers to the rules created to structure the fields or elements of metadata (metacontent).^[11] A single metadata scheme may be expressed in a number of different markup or programming languages, each of which requires a different syntax. For example, Dublin Core may be expressed in plain text, HTML, XML, and RDF.^[12]

A common example of (guide) metacontent is the bibliographic classification, the subject, the Dewey Decimal class number. There is always an implied statement in any “classification” of some object. To classify an object as, for example, Dewey class number 514 (Topology) (i.e. books having the number 514 on their spine) the implied statement is: “<book><subject heading><514>. This is a subject-predicate-object triple, or more importantly, a class-attribute-value triple. The first two elements of the triple (class, attribute) are pieces of some structural metadata having a defined semantic. The third element is a value, preferably from some controlled vocabulary, some reference (master) data. The combination of the metadata and master data elements results in a statement which is a metacontent statement i.e. “metacontent = metadata + master data”. All these elements can be thought of as “vocabulary”. Both metadata and master data are vocabularies which can be assembled into metacontent statements. “

The MetadataMart serve as the source for both metadata vocabulary and MDM for the Master Data Vocabulary.

For the Master Data Vocabulary consider schema.org which defines most of the schemas we need. Consider the following schema.org Persons Properties of Objects and Predicates:

Thing > Person

A person (alive, dead, undead, or fictional).

Property	Expected Type	Description
Properties from Person
additionalName	Text	An additional name for a Person, can be used for a middle name.
address	PostalAddress	Physical address of the item.

The key is to link source data in the Enterprise via a Business Vocabulary from MDM to the Source Data Metadata Vocabulary from a Metadata Mart to conform the triples collected internally and externally.

In essence information from the web applications can be integrated with the dimensional metadata mart, MDM Model and existing Data Warehouses providing lineage for selected raw data from web to Enterprise conformed Dimensions that have gone thru Data Quality processes.

Please let me know your thoughts.

Creating a Metadata Mart via TSQL – Complete Data Profiling Kit – Download

15 Replies

With Data Profiling can apply the age old management adage “You get what you inspect, not what you expect” Readers Digest.

This article will describe how to implement a data profiling dashboard in Excel and a metadata repository as well as the TSQL required to load and update the repository.

Future articles will explore the data model as well as column and table relationship analysis using the Domain profiling results.

Data Profiling is essential to properly determine inconsistencies as well data transformation requirements for integration efforts.

It is also important to be able to communicate the general data quality for the datasets or tables you will be processing.

With the assistance over the years of a few friends(Joe Novella, Scott Morgan and Michael Capes) as well as the work of Stephan DeBlois, I have created a set of TSQL Scripts that will create a set of tables that will provide the statistics to present your clients with a Data Quality Scorecard comparable to existing vendor tools such as Informatica , Data Flux , Data Stage and the SSIS Data Profiling Task.

This article contains a Complete Data Profiling Kit – Download (Code, Excel Dashboards and Samples) providing capabilities similar to leading vendor tools, such as Informatica, Data Stage, Dataflux etc…

The primary difference is the repository is open and the code is also open and available and customizable. The profiling process has 4 steps as follows:

Create Table Statistics – Total count of records.
Create Column Statistics – Specific set of statistics for each column(i.e… minimum value, maximum value , distinct count, mode pattern, blank count, null count etc…)
Create Column Domain Statistics – domain count(count of unique vales),domain pattern(SSN=999-99-9999,ZIP = 99999-9999)

Here is a sample of the Column Statistics Dashboard: There are three panels show one worksheet for a sample “Customers” file.

Complete Column Dashboard:

DatabaseName – Source Data Base Name
SchemaName – Source Schema Name
TableName – Source Table Name
ColumnName – Source Column Name within this Table
RecordCount – Table record count
DistinctDomainCount – The number of distinct values within entire table.
UniqueDomainRatio – The ration of unique records to total records.
MostPopularDomain – The most frequently occurring value for this column.
MostPopularDomainCount – The total for the most popular domain.
MostPopularDomainRatio – The ration for the most popular value to total number of values.
MinDomain – The lowest value with in this column.
First indicator for valid values violations.
MaxDomain – The highest value with in this column.
First indicator for valid values violations.
MinDomainLength – Length in characters for the minimum domain.
MaxDomainLength – Length in characters for the maximum domain.
NullDomainCount – The number of nulls within this column.
NullDomainRatio – The ration of nulls to total records.
BlankDomainCount – The number of blanks within this column.
BlankDomainRatio – The ration of blanks to total records.
DistinctPatternCount – The number of Distant values within entire table.
MostPopularPattern – The most frequently occurring pattern for this column.
MostPopularPatternCount – The total for the most popular domain.
MostPopularPatternRatio – The ration for the most popular pattern to total number of patterns.
InferredDataType – The data inferred from the values.

Complete Dashboard:

Column Profiling Dashboard 1-3:

Column Profiling Dashboard 2-3:

Column Profiling Dashboard 3-3:

Domain Analysis:

In the example below you see an Excel Worksheet that contains a pivot table allowing you to examine a columns patterns , in this cse Zip code, and subsequently drill into the actual values related to one of the patterns. Notice the Zip code example, we will review the pattern “9999”, or Zip code with only 4 numeric digits. When you click on the pattern og “9999” the actual value is revealed is

Domain Analysis for ZipCode

Domain Analysis for Phone1

Running the Profiling Scripts Manually

Perquisites:

The scripts support two databases. One is the MetadataMart for storing the profiling results, the other is the source for your profiling.

There are four scripts , simple run them in the following order:

0_Create Profilng Objects – Create all the Data Profiling Tables and Views
1_Load_TableStat – This script will load records into the TableStat profiling table
2_Load ColumnStat – This script will load records into the ColumnStat profiling table. Specify Database, Schema and Table name filters as needed. Example, to profile every table names starting with “Dim” then change the table filter to SET @TABLE_FILTER = ‘Dim%. Specify the Database where the Data Profiling tables reside DECLARE @MetadataDB VARCHAR(256) SET @MetadataDB = ‘ODS_METADATA’ Specify Database, Schema and Table name filters as needed. Example, to profile every table names starting with “Dim” then change the table filter to SET @TABLE_FILTER = ‘Dim%
SET @DATABASE_FILTER = ‘CustomerDB’
SET @SCHEMA_FILTER = ‘dbo’
SET @TABLE_FILTER = ‘%Customer%’
SET @COLUMN_FILTER = ‘%’
3_load DomainStat – This script will load records into the DomainStat profiling table. Specify the Database where the Data Profiling tables reside DECLARE @MetadataDB VARCHAR(256) SET @MetadataDB = ‘ODS_METADATA’ Specify Database, Schema and Table name filters as needed. Example, to profile every table names starting with “Dim” then change the table filter to SET @TABLE_FILTER = ‘Dim%
SET @DATABASE_FILTER = ‘CustomerDB’
SET @SCHEMA_FILTER = ‘dbo’
SET @TABLE_FILTER = ‘%Customer%’
SET @COLUMN_FILTER = ‘%’
-1_DataProfiling – Restart – Deletes and recreates all profiling tables

Please feel free to contact me with any questions?

Microsoft BI Stack Leveraged with Vendor Add – In’s for Data Quality and Data Lineage.

Leave a reply

Low cost strategic Business Analytics and Data Quality

This paper will outline the advantages of developing your complete Business Analytics solution within the Microsoft suite of Business Intelligence capabilities, including Microsoft SQL Server(SSIS, SSAS, SSRS and Microsoft Office).

This approach will yield and very powerful and detailed Analytical application focused on your business and it particular needs both data sources and business rules. In addition to a reporting and analytical solution, you will have insight and access to the data and business rule “Lineage” or DNA allowing you click on any anomalies in any dash board and not only drill into the detail, but also see how data was transformed and/or changes as it was process to create your dashboard, excel workbook or report.

Lineage as it applies to data in Business Intelligence provides the capability see track back where data came from. Essentially it supports “Metric Decomposition” which is a process for breaking down a metric (business calculation) into separate parts. With this capability you can determine your data’s ancestry or Data DNA.

Why does this matter? The key to Lower Cost of Ownership, that’s why.

An “Absolute Truth” in today’s environment (Business Discovery, Big Data, Self Service Whatever?) is that you never have enough detail and usually have to stop clicking before you have an answer to your question and look at another report of application. Just when you about to get to the “bottom of it” you run out of clicks As Donald Farmer (VP Product Management) pointed out recently we are hunters and that’s why we like search engines. We enter phrases for things were hunting, and the search engine presents large list which we then follow the “tracks” of what we ae looking for , if we go down a wrong path we back track and continue, continually getting closer to our answer. Search engines implement lineage thru “key words “ and there relationship to web pages, which intern link to other pages. This capability is now “heuristic” or common sense.

While the method of analyzing data is “Rule of Thumb” for internet searches, it is not in terms of analyzing your business reports.

If you were able to “link” all your data from it the original source to the final destination, then would you have the ability to “hunt” through your data the way you hunt with Google, Bing etc…

Lineage Example

ABI Data Lineage

Let apply this concept to customer records with phone numbers (phone1 column). First we clean the phone number (phone1Cleansed column). We keep both columns in our processing results. Then in addition we keep the name of the “Business Rule we used to clean it (phone1_Category), an indicator to identify valid or invalid phone numbers(phone1_Valid) and the actual column name used as a source(phone1_ColumnName). The “phone1_Rule” column is used to tell us which rule was used, the first row used only the Phone Parse rule to format the number, the second row used two rules, NumbersOnly tells us thet the number contains non numeric character and the Phone Parse rule formats the number with periods separating the area code, prefix and suffix.

ABI SSIS validation example with lineage

The end result is when this is included in your reporting or analytical solution you know:

Who: The process that applied the change, if you also log package name(Optional)
What: The final corrected or standardized value(phone1Cleansed)
Where: Which column originated the value.(phone1_ColumnName)
When: The date and time of the change.(Optional)
Why: Which rules were invoked to cause the transformation.(phone1_Category, phone1_Rule and phone1_Valid)

In addition if you then choose to implement data mining and/or “Predictive Analytics” already a part of the Microsoft BI Suite, you will be facing the typical pitfalls associated with trying to predict based on either “dirty data” or highly “scrubbed data”, lacking lineage.

I have assembled a collection of vendors product that when used in conjunction with the Microsoft BI suite can provide this capability at reasonable in line with your investment with Microsoft SQL Server.

If done manually this would require extensive additional coding, however with the tools we have selected, this can be accomplished automatically as well as automating the loading of an Analysis Services cube for Analytics.

Our solution brings together the Microsoft offerings of Actuality Business Intelligence, Melissa Data and HaloBI. This solution can be implemented by any moderately skilled Microsoft developer familiar with the BI Suite(SSIS and SSAS) and does not require any expensive niche ETL or Analytical software.

In the next post I will walk through a real world implementation.

Microsoft SSIS Package

Data Lineage SSIS Example using ABI Profiling, Melissa Data , SSIS Fuzzy Groupng, SSAS cube genertion

Actuality Business Intelligence SSIS Profiling

ABI SSIS Data Profiling

Melissa Data Contact Verify

libraries from pyspark.sql.functions import input_file_name

ML Libraires

import mlflow

from mlflow.tracking import MLFlowClient

Create Table Statistics – Total count of records.

Create Column Statistics – Specific set of statistics for each column(i.e… minimum value, maximum value , distinct count, mode pattern, blank count, null count etc…)

Create Column Domain Statistics – domain count(count of unique vales),domain pattern(SSN=999-99-9999,ZIP = 99999-9999)

Complete Dashboard:

Column Profiling Dashboard 1-3:

Column Profiling Dashboard 2-3:

Column Profiling Dashboard 3-3:

Domain Analysis:

Domain Analysis for ZipCode

Domain Analysis for Phone1

Running the Profiling Scripts Manually

Perquisites:

The scripts support two databases. One is the MetadataMart for storing the profiling results, the other is the source for your profiling.

There are four scripts , simple run them in the following order:

0_Create Profilng Objects – Create all the Data Profiling Tables and Views

1_Load_TableStat – This script will load records into the TableStat profiling table

SET @DATABASE_FILTER = ‘CustomerDB’

SET @SCHEMA_FILTER = ‘dbo’

SET @TABLE_FILTER = ‘%Customer%’

SET @COLUMN_FILTER = ‘%’

SET @DATABASE_FILTER = ‘CustomerDB’

SET @SCHEMA_FILTER = ‘dbo’

SET @TABLE_FILTER = ‘%Customer%’

SET @COLUMN_FILTER = ‘%’

-1_DataProfiling – Restart – Deletes and recreates all profiling tables

Please feel free to contact me with any questions?

Low cost strategic Business Analytics and Data Quality

Why does this matter? The key to Lower Cost of Ownership, that’s why.

Lineage Example

Actuality Business Intelligence SSIS Profiling

Melissa Data Contact Verify