Category Archives: data quality

The Architecture of Resilience: Navigating Metabolic and Digital Transformation

The

Executive Summary

This briefing document synthesizes key insights regarding the critical parallels between biological systems and digital architectures. A central theme emerges: the pursuit of rapid optimization—whether in the human body through weight loss or in enterprise systems through Artificial Intelligence—often leads to accidental sabotage when structural integrity and governance are neglected.

Biological health is predicated on “insulation and governance,” specifically the myelin sheath and metabolic regulation. Similarly, technical performance relies on data governance and structural robusticity. Key findings indicate that rapid physical transformations, such as significant weight loss or high-dose supplementation, can trigger “Slimmer’s Paralysis” and “The Zinc Paradox,” leading to profound neurological dysfunction. In the digital realm, the shift toward “Agentic AI” necessitates a move from mechanical “syntax-based” coding to strategic “reasoning-based” orchestration. The document concludes that true potential is found not in the speed of transformation, but in the integrity of the “wires”—both neurological and digital—that carry the signal.

——————————————————————————–

1. The Biological Paradox: The Risks of Rapid Physical Transformation

Biological systems rely on protective layers and nutritional synergy. When these are compromised during rapid health interventions, the results are often counter-intuitive and detrimental.

1.1 “Slimmer’s Paralysis” and Mechanical Vulnerability

Rapid weight loss can lead to peroneal neuropathy, colloquially known as “Slimmer’s Paralysis.”

Mechanism: The peroneal nerve is located superficially at the fibular head (outer knee). Adipose tissue (fat) provides a protective cushion for this nerve.
Trigger: Excessive weight loss (e.g., following bariatric surgery or extreme dieting) removes this padding, leaving the nerve vulnerable to compression.
Symptoms: Bilateral foot drop (inability to lift the front of the foot), steppage gait, and paresthesia (pins and needles) in the lateral calf or foot.
Case Study: A patient whose BMI dropped from 37.2 to 21.69 in six months experienced significant nerve damage due to a 38% reduction in body weight.

1.2 The Metabolic Relay Race

Nerve health is dependent on a synergistic chain of B vitamins that convert food into fuel.

Thiamine (B1): Essential for the Krebs cycle and nerve membrane integrity.
Riboflavin (B2): Manages the electron transport chain.
Niacin (B3): Facilitates glycolysis and DNA repair.
System Failure: If one “runner” in this relay is missing, energy production for the neuron stops, resulting in systemic breakdown.

1.3 The Zinc Paradox and Copper Deficiency

The modern obsession with zinc for immune health has created a secondary neurological crisis.

Competitive Absorption: Excessive zinc blocks copper absorption pathways in the gut.
Neurological Impact: Copper is the “architect” of myelin. Deficiency can cause spinal cord insulation to drop by up to 56%, manifesting as an “ALS-like phenotype” (muscle wasting, speech disturbances, and unsteadiness).
Cellular Energy: Copper is required for ATP production. Over 80% of individuals with low thyroid hormone feel cold; this is often a cellular energy failure where “batteries” cannot charge due to copper deficiency.

——————————————————————————–

2. Sarcopenia: The Progressive Loss of Function

Sarcopenia is defined as the age-related progressive loss of muscle mass, strength, and function. It is now recognized as a specific disease with its own ICD-10-CM code.

2.1 Diagnosis and Physical Performance

Healthcare providers utilize the SARC-F questionnaire (Strength, Assistance with walking, Rising from a chair, Climbing stairs, Falls) for initial screening.

Muscle Strength Tests: Handgrip tests, chair stand tests (measuring quads), and “Timed-up and go” (TUG) tests are standard for assessment.
Sarcopenic Obesity: The combination of low muscle mass and a high BMI raises complication risks significantly.

2.2 Sarcopenia in Speech and Swallowing

Sarcopenia affects muscles critical to speech and swallowing (dysphagia).

Impact: Older adults may experience reduced endurance for verbal communication and increased aspiration risk.
Intervention: Speech-Language Pathologists (SLPs) use clinical and instrumental assessments to develop strengthening exercises and safe swallowing strategies.

——————————————————————————–

3. The Digital Vibe Shift: AI and Data Governance

In the technical world, the evolution toward AI is described as a “Vibe Shift,” moving from hand-writing logic to orchestrating intent.

3.1 AI Governance as “Data Governance in a Helmet”

AI projects often fail due to “data chaos” rather than model limitations.

Failure Rate: Gartner predicts 60% of AI projects will fail by 2026 due to a lack of AI-ready data.
Governance Integration: AI Governance is foundational Data Governance with added “Adversarial Robustness.” It utilizes frameworks like the NIST AI Risk Management Framework (RMF)to “Map, Measure, and Manage” risk.
Semantic Trust: Validation is shifting from syntax (checking if a field is a string) to reasoning (recognizing that a birth year of 2025 for a current executive is a logical impossibility).

3.2 The Rise of the System Orchestrator

The “Syntax Memorizer” (the developer focused on library arguments) is becoming obsolete, replaced by the System Orchestrator.

Prompt Engineering: Prompts are now treated as structured code. Success requires “Context Engineering”—managing metadata, API definitions, and token budgets.
Subject Matter Expertise (SME): SMEs are more valuable than generalist programmers. A professional who understands niche nuances (e.g., horseback riding) can guide AI to produce higher-quality, accurate content.

3.3 The Zero-Refactor Revolution

Legacy systems (COBOL, IMS) are no longer viewed solely as technical debt but as “untapped IQ.”

Metadata Mechanic: Services can now extract the “DNA” of mainframes (PSBs and DBDs) to create a “context map” for AI without manual refactoring.
Conversational IQ: Organizations can integrate 60 years of historical archives into an intelligence hub (like NotebookLM), allowing users to “talk” to legacy data.

——————————————————————————–

4. Metabolic States: The Ketogenic Tightrope

The ketogenic diet (KD) is a potent tool for “nutritional ketosis” but presents a metabolic paradox.

4.1 Biological Armor

KD inhibits the NLRP3 inflammasome and regulates Drp1-mediated mitochondrial fission. This pre-conditions the brain to survive ischemic crises (strokes) by keeping cellular “power plants” intact.

4.2 Long-Term Risks

While neuroprotective in the short term, long-term KD use has shown risks in animal models:

Metabolic Complications: Potential for fatty liver disease and impaired blood sugar regulation.
Gender Divide: Male subjects in studies developed severe liver dysfunction, while females appeared largely protected.
Recrudescence: A “metabolic echo” where old stroke symptoms temporarily reappear due to physiological stressors like dehydration or infection.

——————————————————————————–

5. Summary of Critical Data Points

From Querying Rows to Querying Reason: 5 Surprising Ways AI is Redefining the Database Professional

Leave a reply

Introduction: The Maintenance Trap

For the modern database professional, the “maintenance trap” is a pervasive reality that stifles career growth and business impact. When your day is consumed by patching, manual tuning, and reactive troubleshooting, you aren’t architecting the future—you’re just keeping the lights on. The numbers confirm this stagnation: 72% of IT budgets are currently swallowed by generic maintenance rather than innovation.

However, we have reached a tipping point where the value scale is tilting. AI is not a replacement for the database expert; it is the long-awaited engine of liberation. Through the convergence of Retrieval Augmented Generation (RAG) and Autonomous systems, the traditional DBA is being reimagined as a hybrid strategist. This shift allows you to stop querying rows and start querying reason, moving from a technician of records to an architect of intelligence.

You’re Already 80% of a Data Scientist (Without Realizing It)

There is a persistent myth that database professionals must start from zero to enter the world of machine learning. The reality is far more empowering: you have already mastered the most difficult phase of the discipline. Industry data reveals that most data scientists spend 80% of their time finding, cleaning, and reorganizing data—a process known as Data Wrangling.

As a database expert, you are already an elite “wrangler.” The strategic pivot now is shifting these intensive tasks to the database itself. By transforming the database into a hybrid data management + machine learning platform, the professional evolves into a high-value AI Engineer or Data Engineer. You are the ideal candidate for these roles because you understand the underlying data structures better than anyone else.

“Most data scientists spend 80 percent of their time on tasks other than analysis, which is a massive inefficiency. Shifting these tasks to the database provides freedom from drudgery and allows the professional to focus on high-impact strategy.”

The “Self-Driving” Database is the Ultimate Career Insurance

The rise of the Autonomous Database is the ultimate insurance policy for your career. By automating the mechanical aspects of data management, these systems utilize three critical pillars:

Self-Driving: Automatically handles provisioning, monitoring, and tuning.
Self-Securing: Provides active protection against external attacks and malicious internal actors.
Self-Repairing: Maximizes uptime by protecting against planned and unplanned maintenance.

The business imperative is undeniable. Database downtime costs an average of $7,900 per minute, and 91% of organizations experience unplanned data center outages. Furthermore, 85% of security breaches occur after a CVE has already been published. By offloading these high-stakes, repetitive tasks to an autonomous system, you reclaim the bandwidth to focus on Architecture, planning, and data modeling. You aren’t losing your job; you are losing the tasks that make your job tedious.

SQL to JSON: The Secret Bridge to Large Language Models

As organizations race to implement Retrieval Augmented Generation (RAG), the database professional becomes the critical link in the AI supply chain. RAG enables Large Language Models (LLMs) to reason over private, enterprise data, but this requires a specialized technical bridge.

The surprising key to this architecture is the conversion of structured SQL results into JSON format. Because LLMs require context in a semi-structured format, the database professional now acts as the guardian of schema context. You are responsible for retrieving specific data and packaging it as a private, structured context that prevents the “hallucinations” common in generic AI. These Augmented Prompts—which combine precise user instructions with retrieved database context—are rapidly becoming the “stored procedures” of the AI era.

Move the Algorithms, Not the Data

The traditional “Data Lake” approach of moving massive datasets to external analytical tools is increasingly obsolete. Our new mantra is: “Move the Algorithms, Not the Data!” By utilizing In-database machine learning (OML), you can execute complex models directly where the data lives.

This shift enables unprecedented scale. For instance, using SPARC M8-2 hardware and the Airline On-Time dataset, systems have demonstrated the ability to process 640 million rows in-memory. Modern database professionals can now perform Feature Engineering—creating derived attributes that reflect domain knowledge—and execute models for Clustering, Anomaly Detection, Time Series Forecasting, and Regression using simple SQL syntax. This eliminates the security risks of data movement and brings Analytical Maturity to the core of the data center.

The Six-Week Transformation Roadmap

The transition from a Database Developer to a Data Scientist is a structured evolution, not a leap into the unknown. This six-week roadmap aligns your existing skills with the Analytical Maturity model:

Week 1: Business Understanding – Identify the core organizational problem.
Week 2: Data Understanding – Explore and profile available data assets.
Week 3: Data Preparation – Leverage your Data Wrangling expertise as the primary driver of project success.
Week 4: Modeling – Apply in-database ML algorithms.
Week 5: Evaluation – Rigorously test the accuracy of insights.
Week 6: Deployment – Move from Diagnostic Analysis (“What happened?”) to ML-Enabled Applications (“What will happen?”).

By following this path, you move beyond simple reporting and begin building Automated ML Applications that provide predictive value to the business.

Conclusion: The Choice to Innovate

We are entering the age of the “Thinking Database.”The industry is moving toward a future where the heavy lifting of maintenance is handled by the system itself, while the innovation is handled by you. Tools like OML Notebooks and Apache Zeppelin are now standard, accessible through the languages you already speak: SQL, Python, and R.

The choice for the database professional is clear. As the “Self-Driving” era takes hold, your value will no longer be measured by how well you maintain the engine, but by where you choose to drive the vehicle. When the database starts managing itself, will you use your new freedom to build the next generation of intelligent applications, or will you keep looking for a better wrench?

The $350k Transition: 5 Surprising Realities of Becoming an AI Engineer

Leave a reply

The software development landscape is undergoing its most dramatic transformation since the shift from assembly to high-level languages. By 2026, projections suggest that 90% of all code will be AI-generated. This reality has sparked a wave of anxiety, but the data tells a more nuanced story of bifurcation rather than obsolescence.

While entry-level tech hiring decreased by 25% year-over-year in 2024 and employment for developers aged 22–25 declined nearly 20%, the demand for senior talent capable of managing AI systems has reached a fever pitch. We are witnessing the death of the “Syntax Memorizer”—the 2022-style developer whose primary value was handwriting functional lines. In their place emerges the System Orchestrator: an engineer who leverages AI to deliver the output once expected from a team of ten.

Underneath the hype, a new layer of engineering work has emerged. This isn’t research or model training; it is product engineering where AI is a system component. If you are a full-stack architect looking to future-proof your career, the transition to becoming an AI engineer requires a deliberate evolution of your technical stack and mindset.

1. Prompting is Now “Table Stakes” (Master Context Engineering)

Many developers remain fixated on the surface layer: perfecting prompts or chasing the latest “hacks.” While prompt engineering was the buzzy role of 2023, it has rapidly become a standard capability, much like using an IDE or keyboard shortcuts.

The professional differentiator is no longer just the prompt; it is Context Engineering. This is the rigorous discipline of managing the non-prompt elements supplied to a model—metadata, API tool definitions, and token budgeting—to ensure reliability and provenance. Your value is shifting from a “Code Writer” to an architect of the environment in which the AI operates.

As Andrew Ng points out, you cannot simply “vibe code” your way to production-grade systems:

“Without understanding how computers work, you can’t just ‘vibe code’ your way to greatness. Fundamentals are still important, and for those who additionally understand AI, job opportunities are numerous!”

2. RAG is the Single Most Critical Skill (The Undervalued Infrastructure)

If you commit to one technical skill this year, make it Retrieval-Augmented Generation (RAG). While social media is captivated by flashy autonomous agents, RAG is the “undervalued infrastructure layer” that startups and enterprises are actually paying for.

RAG is the process of providing a Large Language Model (LLM) with proprietary data at the right time to prevent hallucinations. In practice, this involves:

Converting documents into embeddings(numerical vectors).
Managing vector databases like Pinecone or Qdrant for high-dimensional storage.
Designing semantic retrieval systems that allow models to interact with live, changing data.

This is the foundation of useful AI products. For example, when a DoorDash driver asks how to handle spilled pickle juice, a RAG system retrieves the specific internal protocol for vehicle maintenance to provide an accurate, human-readable answer. Similarly, Spotify uses these patterns to find songs with semantically similar lyrics. Mastering the “boring” plumbing of data flow is what separates a hobbyist from a $350k IC.

3. Workflows Over Agents (The “Deterministic” Advantage)

The term “AI Agent” is dangerously overloaded. In a hype-driven market, non-technical CEOs often demand “autonomous agents” that run until a task is done. In reality, these uncontrolled agentic loopsoften lead to exploding token costs and non-deterministic failures.

The superior architectural pattern is the controlled workflow. As an engineer, your job is to create deterministic outcomes in a non-deterministic world. This requires:

Human-in-the-loop patterns: Designing checkpoints for critical decisions.
Orchestration: Utilizing patterns like “ReAct” or “Orchestrator” to classify and route tasks programmatically.
FinOps Mindset: Implementing observability tools like Helicone or LangSmith to monitor token consumption and latency.

Having a technical opinion on workflows vs. agents is a superpower. Most companies are operating on “social media vibes”; the AI engineer provides the strategic direction and cost control necessary for enterprise scale.

4. The Return of the “CS Fundamentalist”

There is a persistent myth that AI makes Computer Science degrees obsolete. The reality is that as the cost of generating code drops to zero, the cost of the friction created by bad code—security flaws, technical debt, and architectural rot—skyrockets.

Andrew Ng notes that while 30% of traditional CS knowledge (like memorizing syntax) is fading, the remaining 70% is more vital than ever. You cannot verify or supervise AI-generated code if you do not understand the Critical Fundamentals:

Concurrency and Parallelism: Essential for managing asynchronous AI API calls and system throughput.
Memory and Performance Complexity: Vital for optimizing token usage and high-dimensional vector searches.
Networking Basics: Crucial for managing the distributed nature of modern AI services.

Deep technical knowledge is what builds the “design taste” required to know when to introduce an architectural principle and when to push back against a model’s suggestion.

5. Testing isn’t Dead—It Just Got a “Black Box” Problem

Traditional unit testing is insufficient for non-deterministic AI services. Because LLMs are “black boxes,” they require a new testing paradigm focused on Evals (evaluation sets).

Instead of testing for a specific string output, professional AI engineers utilize the LLM-as-a-judgepattern. By creating a “Gold Set” of ideal responses, you can use one LLM to score another’s output on a scale of 1 to 10. This allows you to:

Detect model drift or prompt regressions before they reach the user.
Safely upgrade or downgrade models (e.g., GPT-4o to a smaller, faster model) without breaking functionality.
Ensure that a minor prompt change by a teammate hasn’t compromised system logic.

Flying blind with non-deterministic services is a recipe for losing customer trust. A rigorous testing mindset is now the primary differentiator between an “AI Bro” and a professional engineer.

Conclusion: Crossing the 3-Month Gap

The transition from a standard full-stack developer to a high-earning AI Engineer is a marathon, but the initial competency gap can be bridged in roughly one to three months by following a structured roadmap:

Phase 1: Integrate & Accelerate (Month 1): Adopt AI pair programmers (Cursor, Copilot) and agentic review tools. Focus on moving from simple comments to structured context engineering.
Phase 2: Architect & Orchestrate (Months 2-3):Build a RAG-based application. Store proprietary data in a vector database and implement a controlled workflow using a framework like LangGraph or a manual “human-in-the-loop” pattern.
Phase 3: Strategize & Lead (Ongoing): Develop a quality framework using Evals and LLM-as-a-judge. Quantify your impact on team velocity and begin managing the technical debt that AI code inevitably generates.

In tech-forward hubs like San Francisco, senior individual contributors who master this orchestration are commanding salaries between $200,000 and $350,000.

The question is no longer whether AI will change your job, but how you will respond to the shift. Do you want to be the developer struggling to compete with AI-generated syntax, or the orchestrator designing the systems that command it?

Beyond the Hype: 5 Surprising Realities of the Modern LLM Frontier

Leave a reply

1. Introduction: The Unseen Mechanics of the AI Revolution

Large Language Models (LLMs) have successfully transitioned from laboratory curiosities to ubiquitous enterprise tools. To the casual observer, the progress looks like a linear march toward increasingly “smarter” chatbots. However, the technical reality is far more nuanced. Behind the curtain of viral interfaces, the most impactful breakthroughs are no longer just about increasing parameter counts or ingestion volume. As a Research Strategist, I observe that the real frontier has shifted toward “unseen mechanics”—the sophisticated methods researchers use to steer, optimize, and ground these models to transform them from unpredictable black boxes into high-precision, reliable instruments.

2. The Operational Safety Gap: Why Your Agent “Enters the Wrong Chat”

A critical challenge for enterprise deployment is “operational safety.” While global discourse often focuses on preventing generic harms (e.g., assisting in illegal acts), operational safety addresses a model’s ability to remain faithful to its intended purpose. Recent research, specifically the OffTopicEvalbenchmark, reveals a startling reality: LLMs are prone to “entering the wrong chat.”

When tasked with a professional role—such as an AI bank teller—models frequently fail to refuse out-of-domain (OOD) queries, straying into discussions about poetry or travel advice. The data shows that even top-tier models struggle; Llama-3 and Gemma collapsed to accuracy levels of 23.84% and 39.53% respectively in agentic scenarios. Even GPT-4 plateaus in the 62–73% range. Interestingly, the benchmark identifies Mistral (24B) at 79.96% and Qwen-3 (235B) at 77.77% as the current leaders in operational reliability.

To suppress these failures without the overhead of retraining, researchers are utilizing prompt-based steering. Techniques like Query Grounding (Q-ground) provide consistent gains of up to 23%, while System-Prompt Grounding (P-ground) delivered a massive 41% boost to Llama-3.3 (70B).

“To suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger boosts.”

3. Surgical Alignment: Steering the “Brain” Without Retraining

A major obstacle in fine-tuning is the “superposition” problem: LLM neurons are semantically entangled, often responding to multiple unrelated factors. This makes standard fine-tuning messy, as adjusting one behavior (like bias) often accidentally degrades linguistic fluency.

The Sparse Representation Steering (SRS)framework offers a “surgical” alternative. Using Sparse Autoencoders (SAEs), SRS projects dense activations (n) into a significantly higher-dimensional sparse feature space (m>n). This allows researchers to disentangle activations into millions of monosemantic features. To identify exactly which features to “turn up or down,” SRS utilizes bidirectional KL divergencebetween contrastive prompt distributions to quantify per-feature sensitivity.

This level of precision, often characterized by the L0 norm (the number of non-zero elements), allows developers to modulate specific attributes like truthfulness or safety at inference time with minimal side effects on overall quality.

“Due to the semantically entangled nature of LLM’s representation, where even minor interventions may inadvertently influence unrelated semantics, existing representation engineering methods still suffer from… content quality degradation.”

4. The 20% Rule: Efficiency via the “Heavy Hitter Oracle”

Deploying LLMs at scale is hindered by the KV Cache bottleneck. Because the cache scales linearly with sequence length, long conversations eventually overwhelm GPU memory. However, the Heavy Hitter Oracle (H2O) discovery has revealed a counter-intuitive efficiency: LLMs only need a fraction of their “memory” to maintain performance.

Researchers found that a small portion of tokens—Heavy Hitters (H2)—contribute the vast majority of value to attention scores. These tokens correlate with frequent co-occurrences in the text. By formulating KV Cache eviction as a dynamic submodular problem, the H2O framework retains only the most critical 20% of tokens. This results in up to a 29x improvement in throughput. This breakthrough democratizes AI, allowing massive models to run on smaller, cheaper hardware while retaining full contextual awareness.

5. The “Tool-Maker” Evolution: From Passive Solvers to Software Engineers

We are witnessing a fundamental shift from LLMs as “Tool Users” to LLMs as “Tool Makers” (LATM). Frameworks like LATM and CREATOR allow models to recognize when their inherent capabilities are insufficient—such as for complex symbolic logic—and respond by writing their own reusable Python functions.

This enables a cost-effective “division of labor.” An expensive, high-reasoning model (like GPT-4) acts as the Tool Maker, crafting a sophisticated utility function. A lightweight, cheaper model then acts as the Tool User, applying that function to thousands of requests. This allows models to solve problems they were never originally trained for by essentially creating their own specialized software on the fly.

6. The Semantic Shift: Moving Beyond the “Library Card Catalog”

Search technology is evolving from traditional Lexical Search to Semantic Search, fundamentally changing how information is retrieved.

Lexical Search acts like a literal “card catalog.” It relies on exact keyword matching. Searching for “affordable electric vehicles” might miss a document about a “Tesla Model 3” if those specific words are absent.
Semantic Search functions like a “knowledgeable librarian.” Using Dense Embeddings and Natural Language Processing (NLP), it maps queries into a vector space where similar concepts are mathematically grouped. It understands that “budget” and “affordable” are conceptually linked.

By leveraging Vector Databases (such as Milvus or Qdrant), modern systems now utilize a Hybridapproach. This combines the literal precision and speed of lexical search with the deep conceptual “brain” of semantic search, ensuring that intent is captured even when language is misaligned.

7. Conclusion: The Dawn of the “Interpretable” Era

The advancements moving through the AI frontier—from sparse steering and heavy-hitter optimization to autonomous tool-making—signal the end of the “black box” era. We are entering a phase where LLMs are becoming modular, efficient, and, most importantly, interpretable. By moving toward surgical control over internal representations, we move closer to systems we can truly understand and govern.

As we look forward, a vital question remains for the industry: Does the future of AI rely on building ever-larger models, or is the true path to intelligence found in making our control over them more modular and precise?

Evolving Data Into Information through Lineage

Leave a reply

Information lineage through data DNA

Ira Warren Whiteside

ORIGINS
COMMON SENSE

We in IT have complicated and diluted the concept and process of analyzing data and business metrics incredibly in the last few decades. We seem to be focusing on the word data.

“There is a subtle difference between data and information.”

Information vs data

There is a subtle difference between data and information. Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.

Examples of Data and Information

The history of temperature readings all over the world for the past 100 years is data.

If this data is organized and analyzed to find that global temperature is rising, then that is information.

The number of visitors to a website by country is an example of data.

Finding out that traffic from the U.S. is increasing while that from Australia is decreasing is meaningful information.

Often data is required to back up a claim or conclusion (information) derived or deduced from it.

For example, before a drug is approved by the FDA, the manufacturer must conduct clinical trials and present a lot of data to demonstrate that the drug is safe.

“Misleading” Data

Because data needs to be interpreted and analyzed, it is quite possible — indeed, very probable — that it will be interpreted incorrectly. When this leads to erroneous conclusions, it is said that the data are misleading. Often this is the result of incomplete data or a lack of context.

For example, your investment in a mutual fund may be up by 5% and you may conclude that the fund managers are doing a great job. However, this could be misleading if the major stock market indices are up by 12%. In this case, the fund has underperformed the market significantly.

Comparison chart

Synthesis: the combining of the constituent elements of separate material or abstract entities into a single or unified entity ( opposed to analysis, ) the separating of any material or abstract entity into its constituent elements.

Synthesis

Data into Information is dominant in terms of data movement and replication, in essence data logistics.

Lineage is the key.

And with the simple action of linking data file metadata names to a businesses glossary or terms, Will result in deeply insightful and informative business insight and analysis.

“Analysis the separating of any material or abstract entity into its constituent elements”

In order for a business manager for analysis you need to be able to start the analysis at a understandable business terminology.

And then provide the manager with the ability to decompose or break apart the result.

They are three essential set of capabilities and associated techniquestechniques for analysis and lineage.

Data profiling and domain analysis as well as fuzzy matching components available on my blog https://irawarrenwhiteside.com/2014/04/13/creating-a-metadata-mart-via-tsql/
Meta-data driven creation of a meta-data mart through code generation techniques, implemented.

Underlining each of these capabilities is a set of refined, developed and proven code says for accomplishing these basic fundamental task.

One case study

I have been in this business over 45 years and I’d like to offer one example of the power of the concept of a meta-data mart and lineage as it regards to business insight.

A lineage, information and data story for BCBS

I was called on Thursday and told to attend a meeting on Friday between our companies leadership and the new Chief Analytics Officer. He was prototypical of the new IT a “new school” IT Director.

I had been introduced via LinkedIn to this director a week earlier as he had followed one of my blogs on metadata marts and lineage.

After a brief introduction, our leadership began to speak and the director immediately held up his hand he said “Please don’t say anything right now the profiling you provided me is at the kindergarten level and you are dishonest”

The project was a 20 week $900,000 effort and we were in week 10.

The company has desired to do a proof of concept and better understand the use of the informatics a tool DQ as well as direction for a data governance program.

To date what had been accomplished was in a cumulation of hours of effort in billing that has not resulted in any tangible deliverable.

The project had focused on the implementation and functionally of the popular vendor tool, canned data profiling results and not providing information to the business.

The director commented on my blog post and asked if we could achieve that at his company, I of course said yes.

Immediately I proposed we use the methodology that would allow us to focus on a tops down process of understanding critical business metrics and a bottoms up process of linking data to business terms.

My basic premise was that unless your deliverable from a data quality project can provide you business insight from the top down it is of little value. In essence you’ll spend $900,000 to tell a business executive they have dirty data. At which point he will say to you “so what’s new”.

The next step was to use the business terminology glossary that existed in informatica metadata manager and map those terms to source data columns and source systems, not an extremely difficult exercise. However this is the critical step in providing a business manager the understanding and context of data statistics.

The next step, was the crucial step in which we made a slight modification to the IDQ tool and allowed the storing of the profiling results into a meta-data mart and the association of a business dimension from the business glossary the reporting statistics.

We were able to populate my predefined metadata mart dimensional model by using the tool the company and already purchased.

Lastly by using a dimensional model we were able to allow the business to apply their current reporting tool.

Upon realizing the issues they faced in their business metrics, they accelerated the data governance program and canceled the data lake until a future date.

Now for the results.

Within six weeks we provided an executive dashboard based on a meta-data mart that allowed the business to reassess their plans involving governance and a data lake.

Here are some of the results of their ability to analyze their basic data statistics but mapped to their business terminology.

6000 in properly form SS cents
35,000 dependence of subscribers over 35 years old
Thousands of charges to PPO plans out of the counties they were restricted to.
There were mysterious double counts in patient eligibility counts, managers were now able to drill into those accounts by source system and find that a simple Syncsort utility had been used improperly and duplicated records.

Data Governance Navigating the Information Value Chain Demystifying the path forward

Leave a reply

The challenge for businesses is to seek answers to questions, they do this with Metrics (KPI’s) and know the relationships of the data, organized by logical categories(dimensions) that make up the result or answer to the question. This is what constitutes the Information Value Chain

Navigation

Let’s assume that you have a business problem, a business question that needs answers and you need to know the details of the data related to the business question.

Information Value Chain

Information Value Chain

Business is based on Concepts.

People thinks in terms of Concepts.

Concepts come from Knowledge.

Knowledge comes from Information.

Information comes from Formulas.

Formulas determine Information relationships based on quantities.

Quantities come from Data.

Data physically exist.

In today’s fast-paced high-tech business world this basic navigation (drill thru) business concept is fundamental and seems to be overlooked, in the zeal to embrace modern technology

In our quest to embrace fresh technological capabilities, a business must realize you can only truly discover new insights when you can validate them against your business model or your businesses Information Value Chain, that is currently creating your information or results.

Today data needs to be deciphered into information in order to apply formulas to determine relationships and validate concepts, in real time.

We are inundated with technical innovations and concepts it’s important to note that business is driving these changes not necessarily technology

Business is constantly striving for a better insights, better information and increased automation as well as the lower cost while doing these things several of these were examined and John Thuma’s‘ latest article

Historically though these changes were few and far between however innovation in hardware storage(technology) as well as software and compute innovations have led to a rapid unveiling of newer concepts as well as new technologies

Demystifying the path forward.

In this article we’re going to review the basic principles of information governance required for a business measure their performance. As well as explore some of the connections to some of these new technological concepts for lowering cost

To a large degree I think we’re going to find that why we do things has not changed significantly it’s just how, we know have different ways to do them.

It’s important while embracing new technology to keep in mind that some of the basic concepts, ideas, goals on how to properly structure and run a business have not changed even though many more insights and much more information and data is now available.

My point is in the implementing these technological advances could be worthless to the business and maybe even destructive, unless they are associated with a actual set of Business Information Goals(Measurements KPI’s) and they are linked directly with understandable Business deliverables.

And moreover prior to even considering or engaging a data science or attempt data mining you should organize your datasets capturing the relationships and apply a “scoring” or “ranking” process and be able to relate them to your business information model or Information Value Chain, with the concept of quality applied real time.

The foundation for a business to navigate their Information Value Chain is an underlying Information Architecture. An Information Architecture typically, involves a model or concept of information that is used and applied to activities which require explicit details of complex information systems.

Subsequently a data management and databases are required, they form the foundation of your Information Value Chain, to bring this back to the Business Goal. Let’s take a quick look at the difference between relational database technology and graph technology as a part of emerging big data capabilities.

However, considering the timeframe for database technology evolution, has is introduced a cultural aspect of implementing new technology changes, basically resistance to change. Business that are running there current operations with technology and people form the 80s and 90s have a different perception of a solution then folks from the 2000s.

Therefore, in this case regarding a technical solution “perception is not reality”, awarement is. Business need to find ways to bridge the knowledge gap and increase awarement that simply embracing new technology will not fundamentally change the why a business is operates , however it will affect how.

Relational databases were introduced in 1970, and graph database technology was introduced in the mid to 2000

There are many topics included in the current Big Data concept to analyze, however the foundation is the Information Architecture, and the databases utilized to implement it.

There were some other advancements in database technology in between also however let’s focus on these two

History

1970

In a 1970s relational database, Based on mathematical Set theory, you could pre-define the relationship of tabular (tables) , implement them in a hardened structure, then query them by manually joining the tables thru physically naming attributes and gain much better insight than previous database technology however if you needed a new relationship it would require manual effort and then migration of old to new , In addition your answer it was only good as the hard coding query created

2020

In mid-2000’s the graph database was introduced , based on graph theory, that defines the relationships as tuples containing nodes and edges. Graphs represent things and relationships events describes connections between things, which makes it an ideal fit for a navigating relationship. Unlike conventional table-oriented databases, graph databases (for example Neo4J, Neptune) represent entities and relationships between them. New relationships can be discovered and added easily and without migration, basically much less manual effort.

Nodes and Edges

Graphs are made up of ‘nodes’ and ‘edges’. A node represents a ‘thing’ and an edge represents a connection between two ‘things’. The ‘thing’ in question might be a tangible object, such as an instance of an article, or a concept such as a subject area. A node can have properties (e.g. title, publication date). An edge can have a type, for example to indicate what kind of relationship the edge represents.

Takeaway.

The takeaway there are many spokes on the cultural wheel, in a business today, encompassing business acumen, technology acumen and information relationships and raw data knowledge and while they are all equally critical to success, the absolute critical step is that the logical business model defined as the Information Value Chain is maintained and enhanced.

It is a given that all business desire to lower cost and gain insight into information, it is imperative that a business maintain and improve their ability to provide accurate information that can be audited and traceable and navigate the Information Value Chain Data Science can only be achieved after a business fully understand their existing Information Architecture and strive to maintain it.

Note as I stated above an Information Architecture is not your Enterprise Architecture. Information architecture is the structural design of shared information environments; the art and science of organizing and labelling websites, intranets, online communities and software to support usability and findability; and an emerging community of practice focused on bringing principles of design, architecture and information science to the digital landscape. Typically, it involves a model or concept of information that is used and applied to activities which require explicit details of complex information systems.

In essence, a business needs a Rosetta stone in order translate past, current and future results.

In future articles we’re going to explore and dive into how these new technologies can be utilized and more importantly how they relate to all the technologies.

Merry Christmas Data Classification, Feature Engineering , Data Governance. ‘How to’ do it and some code take a look

Leave a reply

I was heavily involved in business intelligence, data warehousing and data governance as of several years ago and recently have had many chaotic personal challenges, upon returning to professional practice I have discovered things have not changed that much in 10 yearsagovernance The methodologies and approaches are still relatively consistent however the tools and techniques have changed and In my opinion not for the better, without focusing on specific tools I’ve observed that the core to data or MDM is enabling and providing a capability for classifying data into business categories or nomenclature.. and it has really not improved.

This basic traditional approach has not changed, in essence man AI model predicst a Metric and is wholly based on the integrity of its features or Dimensions.

Therefore I decided, to update some of the techniques and code patterns, I’ve used in the past regarding the information value chain and or record linkage , and we are going to make the results available with associated business and code examples initially with SQL Server and data bricks plus python

My good friend, Jordan Martz of DataMartz fame has greatly contrinuted to this old mans BigData enlightenment as well as Craig Campbell in updating some of the basic classification capabilities required and critical for data governance. If you would like a more detailed version of the source as well as the test data, please send me an email at iwhiteside@msn.com. Stay tuned for more update and soon we will add Neural Network capability for additional automation of “Governance Type” automated classification and confidence monitoring.

Before we focus on functionality let’s focus on methodology

Initially understand key metrics to be measured/KPI‘s their formulas and of course teh businesse’s expectation of their calculations

Immediately gather file sources and complete profiling as specified in my original article found here

Implementing the processes in my meta-data mart article would provide numerous statistics regarding integers or float field however there are some special considerations for text fields or smart codes

Before beginning classification you would employ similarity matching or fuzzy matching as described here

As I said I posted the code for this process on SQL Server Central 10 years ago here is s Python Version.

databricks-logo Roll You Own – Python Jaro_Winkler(Python)

databricks-logoroll You Own – Python Jaro_Winkler(Python)
Import Notebook
Step 1a – import pandas
import pandas

Step 2 – Import Libraries

libraries from pyspark.sql.functions import input_file_name

from pyspark.sql.types import *
import datetime, time, re, os, pandas

ML Libraires

from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, NGram, HashingTF, IDF, Word2Vec, Normalizer, Imputer, VectorAssembler
from pyspark.ml import Pipeline

import mlflow

from mlflow.tracking import MLFlowClient

from sklearn.cluster import KMeans
import numpy as np
Step 3 – Test JaroWinkler

JaroWinkler(‘TRAC’,’TRACE’)
Out[5]: 0.933333
Step 4a =Implement JaroWinkler(Fuzzy Matching)

%python

def JaroWinkler(str1_in, str2_in):

if(str1_in is None or str2_in is None):
return 0

tr=0
common=0
jaro_value=0
len_str1=len(str1_in)
len_str2=len(str2_in)
column_names=[‘FID’,’FStatus’]

df_temp_table1 = pandas.DataFrame(columns=column_names)
df_temp_table2 = pandas.DataFrame(columns=column_names)

#clean_string(str1_in)
#clean_string(str2_in)

if len_str1 > len_str2:
swap_len=len_str2
len_str2=len_str1
len_str1=swap_len
swap_str=str1_in
str1_in=str2_in
str2_in=swap_str

max_len=len_str2

iCounter=1

while(iCounter <= len_str1):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table1=pandas.concat([df_temp_table1,df], ignore_index=True)
iCounter=iCounter+1

iCounter=1

while (iCounter <= len_str2):
df=pandas.DataFrame([[iCounter,0]], columns=column_names)
df_temp_table2=pandas.concat([df_temp_table2,df], ignore_index=True)
iCounter=iCounter+1

iCounter=1
m=round((max_len/2)-1,0)
i=1

while(i <= len_str1):
a1=str1_in[i-1]

if m >= i:
  f=1
  z=i+m 
else:
  f=i-m
  z=i+m 
if z > max_len:
  z=max_len
while (f <= z):
  a2=str2_in[int(f-1)]
  if(a2==a1 and df_temp_table2.loc[f-1].at['FStatus']==0):
    common = common + 1
    df_temp_table1.at[i-1,'FStatus']=1
    df_temp_table2.at[f-1,'FStatus']=1 
    break
  f=f+1
i=i+1

i=1
z=1
while(i <= len_str1): v1Status=df_temp_table1.loc[i-1].at[‘FStatus’] if(v1Status==1): while(z <= len_str2): v2Status=df_temp_table2.loc[z-1].at[‘FStatus’] if(v2Status==1): a1=str1_in[i-1] a2=str2_in[z-1] z=z+1 if(a1 != a2 ): tr = tr+0.5 break break i=i+1 wcd = 1.0/3.0 wrd = 1.0/3.0 wtr = 1.0/3.0 if (common != 0): jaro_value = (wcd * common)/ len_str1 + (wrd * common) / len_str2 + (wtr * (common – tr)) / common return round(jaro_value,6) Step 4b – Register JaroWinkler spark.udf.register(“JaroWinkler”, JaroWinkler) Out[6]:
Step8a – Bridge vs Master vs AssociativeALL

%sql
DROP TABLE IF EXISTS NameAssociative;
CREATE TABLE NameAssociative;

SELECT
Name
,NameInput

,sha2(replace ( NameLookput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameLookupCeaned ,a.NameLookupKey
,sha2(replace( NameInput,’%[^a-Z0-9, ]%’,’ ‘) , 256) as NameInput,b.NameInputKey
,JaroWinkler(a.NameLookup, b.NameInput) MatchScor
,RANK() OVER (Partition by a.DetailedBUMaster ORDER BY JaroWinkler(a.NameLookupCleande, b.NameInputCleaned) DESC) NameLookup,b.NameLookupKey
FROM NameInput as a
CROSS JOIN NameLookup as b

#truthtopower

Leave a reply

www.youtube.com/watch

Chocolate cake, MDM, data quality, machine learning and creating the information value chain’

Leave a reply

The primary take away from this article will be that you don’t start your Machine Learning project, MDM , Data Quality or Analytical project with “data” analysis, you start with the end in mind, the business objective in mind. We don’t need to analyze data to know what it is, it’s like oil or water or sand or flour.

Unless we have a business purpose to use these things, we don’t need to analyze them to know what they are. Then because they are only ingredients to whatever we’re trying to make. And what makes them important is to what degree they are part of the recipe , how they are associated

Business Objective: Make Desert

Business Questions: The consensus is Chocolate Cake , how do we make it?

Business Metrics: Baked Chocolate Cake

Metric Decomposition: What are the ingredients and portions?

2/3 cup butter, softened

1-2/3 cups sugar

3 large eggs

2 cups all-purpose flour

2/3 cup baking cocoa

1-1/4 teaspoons baking soda

1 teaspoon salt

1-1/3 cups milk

Confectioners’ sugar or favorite frosting

So here is the point you don’t start to figure out what you’re going to have for dessert by analyzing the quality of the ingredients. It’s not important until you put them in the context of what you’re making and how they relate in essence, or how the ingredients are linked or they are chained together.

In relation to my example of desert and a chocolate cake, an example could be, that you only have one cup of sugar, the eggs could’ve set out on the counter all day, the flour could be coconut flour , etc. etc. you make your judgment on whether not to make the cake on the basis of analyzing all the ingredients in the context of what you want to, which is a chocolate cake made with possibly warm eggs, cocunut flour and only one cup of sugar.

Again belaboring this you don’t start you project by looking at a single entity column or piece of data, until you know what you’re going to use it for in the context of meeting your business objectives.

Applying this to the area of machine learning, data quality and/or MDM lets take an example as follows:

Business Objective: Determine Operating Income

Business Questions: How much do we make, what does it cost us.

Business. Metrics: Operating income = gross income – operating expenses – depreciation – amortization.

Metric Decomposition: What do I need to determine a Operating income?

Gross Income = Sales Amount from Sales Table, Product, Address

Operating Expense = Cost from Expense Table, Department, Vendor

Etc…

Dimensions to Analyze for quality.

Product

Address

Department

Vendor

You may think these are the ingredients for our chocolate cake in regards to business and operating income however we’re missing one key component, the portions or relationship, in business, this would mean the association,hierarchy or drill path that the business will follow when asking a question such as why is our operating income low?

For instance the CEO might first ask what area of the country are we making the least amount of money?

After that the CEO may ask well in that part of the country, what product is making the least amount of money and who manages it, what about the parts suppliers?

Product => Address => Department => Vendor

Product => Department => Vendor => Address

Many times these hierarchies, drill downs, associations or relationships are based on various legal transaction of related data elements the company requires either between their customers and or vendors.

The point here is we need to know the relationships , dependencies and associations that are required for each business legal transaction we’re going to have to build in order to link these elements directly to the metrics that are required for determining operating income, and subsequently answering questions about it.

No matter the project, whether we are preparing for developing a machine learning model, building an MDM application or providing an analytical application if we cannot provide these elements and their associations to a metric , we will not have answered the key business questions and will most likely fail.

The need to resolve the relationships is what drives the need for data quality which is really a way of understanding what you need to do to standardize your data. Because the only way to create the relationships is with standards and mappings between entities.

The key is mastering and linking relationships or associations required for answering business questions, it is certainly not just mastering “data” with out context.

We need MASTER DATA RELATIONSHIP MANAGEMENT

not

MASTER DATA MANAGEMENT.

So final thoughts are the key to making the chocolate cake is understanding the relationships and the relative importance of the data/ingredients to each other not the individual quality of each ingredient.

This also affects the workflow, Many inexperienced MDM Data architects do not realize that these associations form the basis for the fact tables in the analytical area. These associations will be the primary path(work flow) the data stewards will follow in performing maintenance , the stewards will be guided based on these associations to maintain the surrounding dimensions/master entities. Unfortunately instead some architects will focus on the technology and not the business. Virtually all MDM tools are model driven APIs and rely on these relationships(hierarchies) to generate work flow and maintenance screen generation. Many inexperienced architects focus on MVP(Minimal Viable Product), or technical short term deliverable and are quickly called to task due to the fact the incurred cost for the business is not lowered as well as the final product(Chocolate Cake) is delayed and will now cost more.

Unless the specifics of questionable quality in a specific entity or table or understood in the context of the greater business question and association it cannot be excluded are included.

An excellent resource for understanding this context can we found by following: John Owens

Final , final thoughts, there is an emphasis on creating the MVP(Minimal Viable Product) in projects today, my take is in the real world you need to deliver the chocolate cake, simply delivering the cake with no frosting will not do,in reality the client wants to “have their cake and eat it too”.

Note:

Operating Income is a synonym for earnings before interest and taxes (EBIT) and is also referred to as “operating profit” or “recurring profit.” Operating income is calculated as: Operating income = gross income – operating expenses – depreciation – amortization.

DNA and the concept of MDM(Master Data Management) have many similarities.

1 Reply

DNA and the concept of MDM( Master Data Management) or Modern ML/AI Data Preparation have many similarities.

“There is a subtle difference between data and information.”
We in IT have complicated and diluted the concept and process of analyzing data and business metrics incredibly in the last few decades. We seem to be focus in on the word data.

And if you consider the primary business objectives of MDM is provide consistent answers with standard business definitions and an understanding of the relationship or mappings of business outcomes to data elements.

DNA vs MDM or IT’s version of DNA.

The graphic I’ve chosen for this post symbolizes the linkage or lineage of human beings to DNA. What I’d like to do is relate the importance of lineage of data to examples of human how DNA communicate lineage and then discuss it in our IT as it relates to business functionality

Living Organisms are very complex as is a company and its data or information.

The genetic information of every living organism is stored inside these nucleic the basic data.

There are two types of nucleic acids(NA) namely:

DNA– Defines Traits, Characteristics

RNA – Communication, transfers information and synthesis.

Lets examine them.

DNA- Defines Traits, Characteristics

DNA-Deoxyribonucleic acid – In most living organisms (except for viruses), genetic information is stored in the form of DNA.

RNA – Communication, transfers information and synthesis

RNA – can move around in the cells of living organisms and serves as a genetic messenger, passing the information stored in the cell’s DNA from the nucleus to other parts of the cell for protein synthesis.

So here goes this is a bit of a stretch but if you consider DNA is the “Data” and a person the “Information” is created from the communication through the RNA.

To continue the analogy the DNA or chromosomes in and by themselves are out of context. It’s only once they been passed from one person to the next driven by RNA and result in a human being in that they become contextually realized as a human.

Again if we break down human DNA and inspected it, we can tell many things origin, or ancestory , traits of the person possible, diseases of the person, but it needs to be processed for us to understand the actual person.

So my point in relation to various approaches to traditional MDM or Master Data Management yis that this(DNA) is how life is created, it’s science, it’s not a methodology or product in approach a vendor guess work it’s real and the main point is “lineage” is the key

“There is a subtle difference between data and information. Data are the facts or details from which is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.”

Examples of Data and Information as it relates to MDM

The history of temperature readings all over the world for the past 100 years is data.

If this data is organized and analyzed to find that global temperature is rising, then that is information.

The number of visitors to a website by country is an example of data.
Finding out that traffic from the U.S. is increasing while that from Australia is decreasing is meaningful information.
Often data is required to back up a claim or conclusion (information) derived or deduced from it.
For example, before a drug is approved by the FDA, the manufacturer must conduct clinical trials and present a lot of data to demonstrate that the drug is safe.

“Mislesading” Data”

Data needs to be interpreted and analyzed, it is quite possible — indeed, very probable — that it will be interpreted incorrectly.

When this leads to erroneous conclusions, it is said that the data are misleading. Often this is the result of incomplete data or a lack of context.

Comparison charts

“Synthesis: the combining of the constituent elements of separate material or abstract entities into a single or unified entity ( opposed to analysis, ) the separating of any material or abstract entity into its constituent elements.”

Synthesis in MDM

Communication, transfers information and synthesis, like RNA several companies are prevelant in terms of data movement and replication, in essence data logistics

Defines Traits, Characteristics like DNA many companies have developed and refined products and the techniques required for during these. They are data profiling , domain pattern profiling and record linkage, the basis of transforming data into information .

However lineage is the key, and without this to serve as a connection between data and information, in essence there is no information. And in this case the “information” is the business term from the business glossary.

The integration of data movement capability and the linking of data profiling capabilities can result in providing a business the capabilities answering business question with certitude through the transparency of lineage.

To translate this into “business terms” it’s very similar to providing an audit trail in that a business can ask this questions like what customers are the most profitable look at those customers and then drill into wood products what areas with those profits coming

What lineage accomplishes is to lay a trail of cookie crumbs for data movement but for your business questions, it’s simply makes sure that you can connect the dots as data gets moved and/or translated and/or standardize and or cleansed throughout your enterprise.

And with the simple action of linking data file metadata of files, columns , profiling results to a businesses glossary or business terms, will result in deeply insightful and informative business insight and analysis.

“Analysis the separating of any material or abstract entity into its constituent elements”

In order for a business manager for analyze you need to be able to start the analysis at a understandable business terminology level. And then provide the manager with the ability to decompose or understand lineage from a logical perspective.

There are three essential capabilities required for analysis and utilizing lineage to answer business questions via a meta-data mart and these are very similar to the pattern that exist in DNA.

1. Data profiling and domain(column) analysis as well as fuzzy matching processes that are available in many forms:

a. – Scan all the values within each column and provide statistics(counts) such as minimum value, maximin value, mode(most occuring) value, number of missing value etc…

b. Frequency or Column value Patterns – determine the counts of distinct values within a column and also identify the distinct pattern occurring for all values within a single column and associated counts ie(SSN = 123-45-6789 , SSN PATTERN = ‘999-99-9999’

c. Fuzzy Matching( similarity algorithms) – This capability enables the find the counts of duplicate or similar text values.

2. The results need to be stored is a “Metadata-mart” in order to see the patterns, results and associations providing lineage and retatingraw data to business terms and hierarchies.

3. Visualization and analysis capability to allow for analysis, drill down into data mart aggregated and statistical contents and associated business hierarchies and businessterminology

Underlying each of these analytical capabilities is a set of refined processes, developed and proven code for accomplishing these basic fundamental task.

In future post I will describe how to implement these capabilities with or without vendor products, from a logical perspective.

References:

http://www.diffen.com/difference/Data_vs_Information

Ira Warren Whiteside's Blog- Information Sherpa

Perception is Perception “Awareness is Reality”

Category Archives: data quality

The Architecture of Resilience: Navigating Metabolic and Digital Transformation

From Querying Rows to Querying Reason: 5 Surprising Ways AI is Redefining the Database Professional

The $350k Transition: 5 Surprising Realities of Becoming an AI Engineer

Beyond the Hype: 5 Surprising Realities of the Modern LLM Frontier

Evolving Data Into Information through Lineage

ORIGINS
COMMON SENSE

Information vs data

Examples of Data and Information

“Misleading” Data

Comparison chart

Synthesis

Lineage is the key.

One case study

Now for the results.

Data Governance Navigating the Information Value Chain Demystifying the path forward

Navigation

Information Value Chain

Demystifying the path forward.

History

1970

2020

Nodes and Edges

Takeaway.

Merry Christmas Data Classification, Feature Engineering , Data Governance. ‘How to’ do it and some code take a look

libraries from pyspark.sql.functions import input_file_name

ML Libraires

import mlflow

from mlflow.tracking import MLFlowClient

#truthtopower

Chocolate cake, MDM, data quality, machine learning and creating the information value chain’

DNA and the concept of MDM(Master Data Management) have many similarities.