Monthly Archives: January 2009

Providing Operational Analytics – Microsoft Analysis Services / Informatica / QlikView

 

Real Time Update Scenarios

Ira Warren Whiteside

Recently a client has described a need for real time (or near-real time) movement of data from their operational logistics application to an analytics/reporting visualization environment where presentation tools such as Microsoft Reporting Services, Business Objects or QlikView consume the data. First we will focus on the “Real Time” update scenarios. We can address presentation in a separate document. Based on the discussion we noted three possible approaches to solving this problem.

1. Replicate Tables using SQL Server Replication and use Business Objects directly against the OLTP.

2. Utilize a real time update process (Informatica/SQL Server) to incrementally update tables and use Business Objects directly against the OLTP.

3. Use a OLAP or Memory based approach, using various compression techniques, eliminating the need for additional relational tables and SQL based query tools.

In our discussion several vendors were mentioned:

1. Informatica via Informatica PowerExchange™:MS SQL Server Real-Time Option for updated relational tables. This is a change data capture tool.

2. QlikView – A Memory based reporting and querying tool.

3. SQL Server 2005 – Microsoft DBMS , OLAP and associated applications (SSIS – SQL Server Integration Services, SSAS – SQL Server Analysis Services)

There are other solutions that are not relational based such as Netezza or Sand Technology. I would suggest a brief review of Gartner current analysis as it relates to real time updates.

I would like to add the observation that the transfer problem is a separate logical problem from the analysis visualization problem. The three tools discussed all approach the problem completely differently. Only QlikView attempts to present a solution that combines the two.

There are two approaches to near-real time analytics. They are as follows:

  • Operational Analysis based on loading the source data into compressed proprietary data stores without transformations (Cleansing, Normalization, Hierarchies). This will provide analytical capabilities including metrics relationships and hierarchies that exist in the data model being sourced.
    • The limitations here are no trending, history or complex business rules, metrics that require corporate reference data such as product or organization hierarchies.
    • A benefit in relation to the extraction process is this approach does not require any staging area.
  • Traditional Business Intelligence is based on a metadata driven approach were the source data is transformed to properly analyze a specific set of business metrics and their associated business process hierarchies, including trending and history.
    • A limitation of approach will require a more complex extraction and loading process and a staging area.
    • A major benefits is this approach will “insulate” the reporting or analytical layer from any changes or additions to the source. This is accomplished through using a data driven approach and creating a business dimensional oriented semantic layer. In most cases the metric and dimensions math the business processes and do not change over time, where as the source data and nomenclature is volatile.

I conducted a brief analysis of the vendors mentioned and have the following observations:

  • Informatica – There “Real Time” offering for Microsoft relies on SQL Server 2005 Replication and Publishing and concentrates on providing CDC (Change Data Capture” technology for applications without time date stamps of update switches.

a. This would require replicating the existing database model and reporting directly against it or developing a  dimensional model.

clip_image002

  • QlikView – This is a Memory based tool requiring various QlikView servers for processing and publishing the QlikView’s based on the QlikView Publisher. QlikView has a wizard to create load and transformation scripts. It assumes relationships based on same named fields. In most cases per the recommendation this needs to be manually modified.

clip_image004

  • SQL Server 2005 Analysis Services is similar to QlikView in architecture. The specific “real time “features are Snapshot Isolation and Proactive Caching. Snapshot Isolation insures the data extracted is kept in sync by only pulling data available when the first table is queried. Defining Caching Policies Analysis Services provides high query performance by maintaining a cache of the underlying data in a form optimized for analytical queries. The new Analysis Services 2005 proactive caching feature provides a way for the cube designer to define policies that balance the business needs for performance against those for real-time data. The system runs on “auto pilot,” without the need for explicit processing.

Different policies allow for quite different behaviors. For example:

      • Automatic MOLAP. Real-time data.
      • Scheduled MOLAP. The cube is simply refreshed periodically, irrespective of updates to the source data.
      • Incremental MOLAP. The cube is refreshed incrementally with new data in the source database.
      • Real-time ROLAP. The cube is always in ROLAP mode, and all updates to the source data are immediately reflected in the query results.

The proactive caching settings can be set on each different Analysis Services object (dimension and partition). This provides great flexibility in cases where the policies and the tradeoffs between performance and latency vary for different parts of the cubes .

clip_image006

Microsoft Cubes in conjunction with the SSIS (Pipeline Architecture) and UDM (Universal Data Model) do not require any staging.


Recommendations: I would recommend a POC(Proof of Concept) and the creation of two scenarios, focusing on the two OLAP type options using QlikView and SQL Server 2005 Analysis Services. As part of the scenarios I suggest:

  • Validating existing business requirements including metric formulas and required hierarchies
  • Defining the metadata driven Extraction Architecture (independently of tool).

In reviewing the two tools it is apparent the extraction processes would be similar and SQL based, but relying on different scheduling and/or scheduling processes. Also it would seem using views for database would be rejected out of hands due to performance. A best practices Extraction Architecture would require a metadata driven application architecture that would be data aware of structure or content (domain) changes and automatically adjust load processes accordingly. Each solution would probably require replication. The focus of completing the scenarios would be to evaluate the gaps each of the recommended scenarios/tools have against Extraction Architecture requirements, effort and skill set required to complete the POC.

  • Scalability
    • For QlikView you would need to manually design our queries. In their documentation they mention loading 4 Million rows per minute, without knowing the number of columns or hardware configuration this not very informative.
    • MSAS Using the UDM and Distributed Query capability of MSAS is capable of 120,000 rows per second or 7 million rows per hour on a Dual-core Xeon (EM64T). [1]
  • Testing Approach – I would require full volume testing based on several documented update scenarios and the full spectrum of user queries and the testing of dynamic queries or queries determine if you will need joins in QlikView or Hierarchies in MSAS
  • Source DB impact – We would need to inform and involve the Source DB team in our Extract Architecture and work toward direct access to their tables, most likely in a replicated environment. Each tool (MSAS, QlikView) recommends extracting only the necessary attributes and rows and requires manually adjusting same named fields and joins..[2]
  • Team Structure – I would structure the team around the Extraction Architecture with individuals focusing on Source Profiling, Source Extraction, Load Scripts/Design and finally presentation.

[1] Microsoft SQL Server 2005 analysis Services Performance Guide 

[2] Actually QlikView requires significant manual coding for loading. Specifically QlikView assumes all like field names are keys and attempts to create joins, these must be hand coded for only the required fields FAQ 7 in their doc.

Roll your own Data Profiling – Column Functional Dependency

By Ira Warren Whiteside

 

clip_image001[1]

Actuality Business Intelligence

 

 

A while back I was reading one of Jamie Thomson most excellent posts on Conchango blogs SSIS Junkie SSIS: Data Profiling Task: Part 7 – Functional Dependency and it occurred to me it would be interesting to provide a poor man’s Functional Dependency using on TSQL.

 

First let’s define Functional Dependency.

Functional Dependency

Determines the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column) ; This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the dependency between a column that contains Country or Region Codes and a column that contains States/Provinces. The same Country or Region Codes should always have the same States/Provinces, but the profile discovers violations of this dependency.

 

Now let’s take a crack at writing our own. Naturally when facing a difficult problem in TSQL I turned to SQL Server Central vast library.

 

When analyzing the capabilities needed to determine one columns dependency on another you would obviously need recursion or the ability to link one row to the next. A brief search of existing articles SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13 and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

 

Armed with these two articles I now had all the techniques I need to develop a query or stored procedure to provide Column Functional Dependency.

 

TSQL Functional Dependency

The heart of this effort is to utilize Common Table Expressions(CTE) or the “with clause as defined on MSDN as:

 

“Specifies a temporary named result set, known as a common table expression (CTE). This is derived from a simple query and defined within the execution scope of a single SELECT, INSERT, UPDATE, or DELETE statement. This clause can also be used in a CREATE VIEW statement as part of its defining SELECT statement. A common table expression can include references to itself. This is referred to as a recursive common table expression.”

 

Basically the “with” clause provides you with the ability to logically create CTE’s or result sets ,logically develop your code and employ recursive techniques.

 

A detailed treatment of CTE’s is available at and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

 

Setup Test Data

I have used the same test setup that Jamie Thomson provides for testing the SQL Server 2008 Data Profiling Task.

 

 

select ROW_NUMBER() OVER (partition by sp.Name order by a.PostalCode) RowNum

, a.PostalCode

, sp.Name AS StateProvinceName

, sp.CountryRegionCode

into Addresses

from Person.[Address] a

inner join Person.StateProvince sp

on a.StateProvinceID = sp.StateProvinceID

update Addresses

set CountryRegionCode = ‘ZZ’

where RowNum = 1

 

 

 

This will create a test table called Addresses, which contains a Functional Dependency scenario for Province or States and Countries.

 

 

clip_image003

 

 

 

 

 

Example Functional Dependency Stored Procedure

 

 

USE [AdventureWorks]

GO

/****** Object: Table [dbo].[DependencyColumnTable] Script Date: 01/14/2009 19:07:54 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[DependencyColumnTable]’) AND type in (N’U’))

DROP TABLE [dbo].[DependencyColumnTable]

GO

/****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependency]’) AND type in (N’P’, N’PC’))

DROP PROCEDURE [dbo].[ColumnDependency]

GO

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE PROCEDURE [dbo].[ColumnDependency]

@TableName VARCHAR(128) ,

@Determinant VARCHAR(128) ,

@Dependent VARCHAR(128)

AS

Declare @SQL VarChar(MAX)

SELECT @SQL = ‘ ;WITH DependencyCompare AS (‘

SELECT @SQL = @SQL + ‘ SELECT ‘

SELECT @SQL = @SQL + @Determinant + ‘ ,’ + @Dependent + ‘, ‘

SELECT @SQL = @SQL + ‘ROW_NUMBER() OVER (Partition BY ‘ + @Dependent + ‘ ORDER BY ‘ + @Determinant + ‘ ) AS rownumcalc FROM ‘

SELECT @SQL = @SQL + @TableName

SELECT @SQL = @SQL + ‘ ) SELECT

currow.’ + @Dependent + ‘ AS curDet,

prevrow.’ + @Dependent + ‘ AS prevDet,

nextrow.’ + @Dependent + ‘ AS nextDet,

currow.’ + @Determinant + ‘ AS curDep,

prevrow.’ + @Determinant + ‘ AS prevDep,

nextrow.’ + @Determinant

+ ‘ AS nextDep,

currow.rownumcalc ,

ViolationCount =

Case

When currow.’ + @Determinant + ‘ = prevrow.’ + @Determinant +

‘ and currow.’ + @Dependent + ‘ = prevrow.’ + @Dependent +

‘ then 0

When prevrow.’ + @Dependent + ‘ is null

then 0

Else 1

End’

+

‘ INTO DependencyColumnTable

FROM

DependencyCompare currow ‘

+ ‘ LEFT JOIN DependencyCompare nextrow ON currow.rownumcalc = nextrow.rownumcalc – 1 AND currow.’

+ @Dependent

+ ‘ = nextrow.’

+ @Dependent

+ ‘ LEFT JOIN DependencyCompare prevrow ON currow.rownumcalc = prevrow.rownumcalc + 1 AND currow.’

+ @Dependent

+ ‘ = prevrow.’

+ @Dependent

+ ‘ Order by currow.’

+ @Dependent

Print @SQL

Exec (@SQL)

GO

Exec ColumnDependency

@TableName = ‘Addresses’,

@Dependent = ‘StateProvinceName’,

@Determinant = ‘CountryRegionCode ‘

 

 

The code here is basically leveraging the recursive capabilities of CTE. Our objective here is to compare current rows to previous row columns to determine when a Dependent Column value changes and the Determinant columns has NOT changes. This is a Violation. A detailed explanation of the technique of linking to previous records and be reviewed at SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13. One issue I had was needing to dynamically populate the From clause or Table.

 

I used the technique presented by Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001. This technique relies in creating a @SQL variable and executing the variable as part of the stored procedure.

 

 

 

 

Create Procedure GenericTableSelect

@TableName VarChar(100)

AS

Declare @SQL VarChar(1000)

SELECT @SQL = ‘SELECT * FROM ‘

SELECT @SQL = @SQL + @TableName

Exec ( @SQL)

GO

`

 

 

Example Functional Dependency Results

 

 

clip_image005

 

 

 

As you can see here Alabama is our Determinant column which normally has an associated Dependent Values of US. You can see when the Dependent values changes to ZZ it is tagged a Violation Count as ‘1’

 

 

Example Functional Summary

 

 

/* Sample Execution

exec ColumnDependencyOLAP

*/

/* This SP will create a Summary Table and calculate the Support and Vio;ation percentages.

exec ColumnDependencyOLAP

*/

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

USE [AdventureWorks]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependencyOLAP]’) AND type in (N’P’, N’PC’))

DROP PROCEDURE [dbo].[ColumnDependencyOLAP]

GO

/****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE PROCEDURE [dbo].[ColumnDependencyOLAP]

AS

Declare @SQL VarChar(MAX)

SELECT @SQL = ‘

;WITH DependencyCompareDetail AS

(

Select count(ViolationCount) as Violations , curDet , curDep,

Case when ViolationCount = 0

then ”Support Percentage”

else ”Violation Percentage”

end as Violation ,

ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ,curDep) AS rownumcalc

FROM [AdventureWorks].[dbo].[DependencyColumnTable]

–Where ViolationCountNull = ”N”

group by curDet, curDep , ViolationCount

)

,DependencyCompareSummary AS (

Select count(ViolationCount) as total , curDet ,

ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ) AS rownumcalc

FROM [AdventureWorks].[dbo].[DependencyColumnTable]

–Where ViolationCountNull = ”N”

group by curDet

)

,DependencyCompareOLAP AS

(

Select

DependencyCompareDetail.curDet ,

DependencyCompareDetail.curDep,

DependencyCompareDetail.Violation,

DependencyCompareDetail.Violations ,

DependencyCompareSummary.total ,

ROW_NUMBER() OVER (Partition BY DependencyCompareDetail.curDet ORDER BY DependencyCompareDetail.curDep) AS rownumcalc

From DependencyCompareDetail

LEFT JOIN DependencyCompareSummary

ON DependencyCompareDetail.curDet =DependencyCompareSummary.curDet

)

Select

curDet as Determinant,

curDep as Dependent,

Violation as ”Dependency Type”,

Violations as ”Dependeny Counts” ,

total as ”Total Determinant Count”,

cast(cast(Violations AS Decimal ) /cast(total AS Decimal )* 100 as decimal (5, 2)) as PercentViolation

INTO DependencyCompareOLAP

from DependencyCompareOLAP

Order by curDet , curDep ‘

Print @SQL

Exec (@SQL)

GO

 

 

Now that we have created a table with the Functional Dependency Violations identified we want to create a table that can be used for Analysis or Reporting, containing the same percentages available in the SSIS 2008n Data Profiling Task Functional Option. Again we have relied on the CTE in order to create a reusable query that can create a single table that has the required for analysis such as Totlal Determinint , Dependents categroized by Supporting Dependents and Violating Dependents.

 

 

Example Functional Summary Results

 

 

 

clip_image007

 

Here are the results from the SSIS 2008 Data Profiling Task for Functional Dependency.

 

clip_image009

 

As you can see the Support Percentages for Alabama are exactly the same.

 

Roll your Own Interactive Scorecard

 

clip_image011

 

I would also add from an analytical perspective you can quickly create a Pivot Table in Microsoft Excel 2007. Simply open Excel 2007, select the Data Tab .

Then select From Other Sources then follow the wizard and create a connection to the DependencyCompareOLAP table to create the Pivot table and Chart below. As you can see here is a quick way to determine the outliers like Alabama.

 

clip_image013

 

Or Manitoba in Canada

 

clip_image015

 

Further Enhancements

 

You also can add additional columns to the table and stored procedures, such as table name, date and time and create a repository that would allow you to track the Violations incrementally or over time, and provide proactive processes to respond to continuing anomalies.

 

Here are several additional links that were helpful.

 

2. Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001

3. SQL Server 2005 Recursion and WITH Clause By Robbe D. Morris

4. The SQL WITH Clause, (Order Your Thoughts, Reuse Your Code) Submitted by Kevin Meade on Tue, 2007-06-19 14:59