Data blog: 2017

lundi 4 septembre 2017

A more resilient data export approach

Introduction

Business intelligence and data-warehouse are becoming a hub for communicating data to third party. Especially for regulation and legal purpose as these data need to be aggregate first before being send.These export need to be resilient in a way that an incident in the ETL or unavailability of the data warehouse should not create a gap in the data being send.

A common approach to that is subscription in SSRS or Sharepoint so the business can verify these data and the same report is send to third party. Even if some trick can let you only trigger the subscription only when the data is ready it's still lack the ability to catch up the days the ETL didn't ran or when a subscription failed to successfully send the data end to end.

Overview of the software

The software goal is to be able to be trigger at the end of an ETL job,and create a task for each instance of a subscription for each days that need to be run. Each time a failure happen the catch up need to run as if it was when the failure happen. As a result the logic of non volatility of the data-warehouse apply to data export as well.

The software is able to send subscription via email and ftp.
The software can be found at this address here

Detail of program behaviour:

For each active subscription the program,It will create one task per subscription for the current day.In addiction to that it also create for each subscription a task for the missing days ( So if we are the 29 and i haven't run for the 26,27,28 it will create 3 more task) .It will also create a task for each days that failed and didn't have success later one.

Each task will be execute in parallel with a max degree of parallelism set in conf file.

The flow of program is the following :

Generate the list of task
For each task connect to Reporting services using web services,pass parameters and generate the report
Gather the report in memory and send it by email or ftp
Log the result

The catch up list all the missing day and Generate the report by passing a date to a report parameter flag as a CatchUp parameter,this parameter will receive the date of the missing/failed day with an offset.This offset is configurable.

Configure the program

Two place let you configure the program.

Application xml configuration

The first one is the SubscriptionController.exe.config file that let you configure the connection string to the report server. Endpoint address need to reflect your SSRS server. This endpoint address can be fount using SSRS Configuration manager.

The connection strings section Let you configure where your subscription database reside. The user that connect will need read and write permissions.

Within the file SubscriptionController.exe.config you will need to change the value for <web_service_ssrs_url> <host database name> <sender_email_address(ex me@mycorp.com)> <email host (ex smtp.mycrop.com>

Database configuration

There is 8 tables :

dbo.email - Email address for each recipient
dbo.ftp - Host, login,password for the ftp connection
dbo.log - Time executed,result,message,time of the report, time of the parameter
dbo.parameter parameter name,parameter value
dbo.report, report name,report description,report address
dbo.subscription All subscription information
dbo.bridge_ftp_email_subscription Link subscription to a list of email or ftp address
dbo.bridge_parameter_subscription Link subscription to a list of parameter

More details on the subscription table :

frequency : DAILY or WEEKLY, the weekly one will start on each Monday of the week
type FTP or EMAIL
active_flag, let you deactivate the subscription without dropping it, 1 mean activate 0 is deactivate
id_report is is a FK to the report table
type_file_export possible type are XML,csv,TIFF,PDF,MHTML,Excel,PowerPoint,Word
title is the Subject of the email in case of an email subscription and the body as well.
flagAddDateEndOfFile a is to add the date of the report at the end of the file
DateFormatEndOfile is the date format for example yyyyMMdd will give four year digit two month digit and two day digits
baseFileName is the name of the file
dateAndOfFileOffset is the offset for the date at the end of the file, for example an offset of -1 will give set the date back by one day.
flagEncodeASCII will encode the file in ASCII if the flag is 1
flagCatchUpLastSuccessfullLoad if 1 it will catch on missing or failing day
dateCatchUpOffset is the offset of the date pass to the report during a catchUp

More details on the log table :

date_execution is the date when the task have been executed

result : FAILED or SUCCESS

message : Let you know the exception in case of a failure and in case of success if it was a catch up or not

id_subscription FK to the subscription table

date_report is in case of a normal run the same as the execution date.In case of a catch up it's the date that haven't been processed ( on earlier example 26,27,28 ).

param_date . In case of a catch up it's the actual date pass to the report for example if the date to re-process is the 26,27,28 and the offset is 2 the date pass to the report parameters will be 24,25,26.

Tracking the subscription

The report of tracking let user follow the subscription they managed with third party.This report have 5 part.

The first two,side by side on the top are a list of report parameters that are pass to the report as well as if they are catch up parameters. On the right the List of destination tell us the list of receiver,ftp or emails.

The third panel "List of subscriptions parameters" give all the parameters of the subscriptions.
Right after the fourth panel "Fail without later success".List all the instance of a subscription that failed without any later successful catch up.

Lastly the fifth panel list all the successful run letting business user know which data have been send to third party.

jeudi 9 février 2017

Historization process for datawarehouse

Introduction

Historization is a process of keeping track of data over time,it happen in different place of the data-warehouse. Today we will discuss about the one that is necessary for the audit of the input data inside the warehouse. We will call it the historization of the Staging layer and we will refer it with the HSTG acronym.

Firstly why do we need to do such a process ? We want to track everything that enter in the data-warehouse in order to understand how each data affect the insight the data-warehouse discover. We need to be able to explain figures display in reports to the final user. Source system can experience issue,to maintain the trust of user in the data-warehouse we need to explain impacts of these source system on the reports.

Overview of the process.

Logical point of view

One of the key point of historical is to maintain an unbreakable timeline. This timeline let you query the table for a key and a date, and it will always answer back with only one row.The result row will give the information regarding the state of this key at any point in time.The information that come back can be, this key didn't exist at this point in time,but one and only one row is always the result for any existing key.

Any table in the historical layer will have couple of extra-field providing information about history :

Starting point,when did this record start to be able, we will call it META_START_DATETIME
Ending point,when did this record stop to be relevant META_EXPIRY_DATETIME.
Status,( META_FLAG ) this field will tell the status of the record,I for insert ( first insert ), U for Update ( newer version of an existing record ) , D for Delete if this Business Key have been drop in the source.
Active flag to tell which one is the latest row META_ACTIVE_ROW,0 if its inactive 1 if its active.

In order to maintain the timeline for the beginning of time to the end of time two operations need to be perform.

First is to create a dummy row that handle the time when the record was not yet available,

So we will have a record with the exact same data than the first real record but META_START_DATETIME= 1 January 10001 at 00:00:00 am the META_EXPIRY_DATETIME is the date the actual real record become available lets said ( 9 february 2017 ). The OMD_ACTIVE_ROW.

Second the latest record for each key need to have META_EXPIRY_DATETIME set to 9999-12-31 (end of time ).

With this two operations the timeline will never be break, a query like Select * from dbo.customer where customer_key=1 and datetime = ( anydate ) will always result with one and only one record.

Example of how data change in the HSTG depending of the source system :

Then on the second day when the address change :

On the third day the record get drop on the source system and doesn't exist anymore.

ETL point of view

Knowing we work with 1:1 relationship between the source system and the Staging the HSTG need to do the historization the same way.As a result each table in the Staging will have a corresponding table in HSTG.

Couple step are required and they are pretty intense in term of computing power:

Full reload of the Stagging table

Reload all the table that don't hold transaction in the Stagging layer.

Identify non-existing row in the HSTG

The first task is to create a Hash of the full row for every row in the staging, and we will compare to the hash of the full row in hstg ( minus any meta field ) that have the META_ACTIVE_ROW set to 1

We will discard ( but keep record of the number of row ) any hash that already exist.

Identify the row that are a new insert and the row that are an update

For the row who doesn't exist in the HSTG,there is two possibilities :

These row are a newer version of the same concept ( Business Key ) for example Customer 123 change address,we already have row for Customer 123 but the address change. So in this case the Business Key is already existing in the HSTG,these row will be tag with a U inside META_FLAG
The row represent a new concept ( Business Key ) which we don't have already.In this case the Business Key doesn't exist these row will be flag with an I inside the META_FLAG.

In order to deal with the deleted key the approach is going the other way than for the updated and inserted key. We taking all the hash of the HSTG and compare them to the hash of the STG.The one that don't exist are the one that have been deleted.

Conclusion

In this article we saw how we can handle historic data from a logical point of view, in order to prove and backtrack any figure coming out of the data-warehouse. In the next article we will discuss the various way to implement this within Microsoft technology.

lundi 30 janvier 2017

Generic flat file import

1) Introduction

Today we will got through a code example of how to integrate various flat file with the same generic SSIS package. Why doing such thing ? Why not just get a custom made package for each file. There is couple reason that justify this approach :

Having only one package for all the data import is easier to maintain,its easier to debug if there is an issue with one file import fixing it will fix it for all the flat file import.
Less time to develop, why developing hundred package instead of just one ?
Easier to cross skill among different member of a BI team,all the file import are exactly the same,each member of the team is more likely to debug an issue.
Easier logic from the Scheduler point of view (Sql Agent ), all the file execute the same package,but the SSIS environment variable is different to handle the specific part of each file.
Easier to manage different environment,all the change are handle by SSIS environment variable

Firstly we will go through the principals component of the package,Secondly more code details and lastly how to deploy the package.

All the file required can be find here https://bitbucket.org/vdiallonort/generic_file_import

2) Overview of the package

For a given pattern let said All the file starting by "Customer", we will loop through all these file load them and archive these file in an separate Archive folder.

All the success and failure and error message will be logged in the database for debugging purpose.

3) Details of each steps

1) Log the start of the process

The first step log the start of the process it use the parameters FILE_TYPE to log what kind of file type is it ( Customer file,Sales,Invoice...),also the File Name ( Customer_20170131.txt ) the id of the row on the log table will be store on the variable LOG_ID as a result of the insert.

2) Insert the data of the CSV file into the table

This part is where most of the work reside,

First the task will look at the structure of the file use the header to to the matching with the destination table ( parameter TABLE_DESTINATION ).

Secondly Create a rollback object and store it into an SSIS package variable. The reason to use this way of handling the rollback is to be able to rollback from C# code after any event.The rollback object is stored in a variable so it can be access from anywhere in the package.It give way more flexibility than the traditional way of starting a flow with a Begin transaction in a Sql task.

Lastly insert the data into the table.

In case of error there is an Event handler attached to this task to log any error.

3) In case of Success Move file to archive

The link between extract_and_load and the step Move to archive folder is conditional to the success of the step extract_and_load. This step is a file system task that move the current file to the Archive folder define on the parameter ARCHIVE_FILE_FOLDER. In case of success of the Move file to archive step it will go to the logging step in case of error it will go to the Rollback step ( step detailed in the 3 bis)

3 bis ) In case of failure Rollback

If the step extract_and_load fail any transaction made to the database ( except the one for logging ) will be rollback.

Its an SSIS Script task in c# with the following code :

SqlTransaction tran = (SqlTransaction)Dts.Variables["User::transaction_object"].Value;

tran.Rollback();

Dts.TaskResult = (int)ScriptResults.Success;

What it does is read the SSIS variable that hold the transaction and rollback this transaction.

4) Commit all the transaction in case of success

In case both the loading and the move of the file to the archive folder where successful the transaction is committed.

5) Case of failure

If the Rollback step have been call then the next step will log a failure of the load for this file and move to the next one.

Things to remember

There is couple things interesting to highlight,first there is one transaction per file so if a fail for any reason it doesn't affect the other ones.
The main idea is to be able to run the package twice without affecting the ETL,so this way the file is not inserted twice on the next load on a case of System error blocking the file transfer ( the file been still in the import folder it would have been pick up on the next load ) in order to achieve that if the file can be move the transaction is get rollback.

3) How to deploy

The first step is to deploy a Csv library on the server ( and any workstation that use the package ) on a folder call Library on the root of c drive,so you should have C:\Library and under two file CsvHelper.dll and CsvHelper.xml. This library is the one that read csv.

On machine after WIn7 the file CsvHelper.dll file is flag as coming from the internet causing issue later on.In order to fix it,the flag need to be remove.Go to the file properties,and unlock the file.

For the next phase you will need an existing SSIS catalogue,if you don't have any, open Integrations Services Catalogue using SSMS.

The next phase is to deploy the project using Visual Studio.First change the security option to "Do not save sensitive data".

Right click on the Project,click properties ->Deployment-> Server name and type the name of the server you want to deploy the project on.

Right click on the project and hit "Deploy",if you don't have one you will have to create a folder under the SSIS catalogue to hold your project.

Now you can find your package in the SSIS catalog architecture.

The next step is to create SSIS Environment ( one for each file type ) and map each variable inside the SSIS environment it to the package parameters. Latter on in the SSIS Job you will be able to choose which SSIS environment you want to use for this specific job.

Then we will create the variables for this environment.

Ten variables need to be create all with a String type,list bellow :

ARCHIVE_FILE_FOLDER This is the path of the folder holding the archive of each file
CS_destination Connection string of the destination database
CS_meta Connection string of the meta-database that store the logging information
FILE_PATTERN File pattern (for example Customer_*.csv will load all the file starting by Customer followed by any character and finishing with .csv)
FILE_TYPE File type is for logging point of view it will tell what kind of data is stored in theses set of file,for example Customer
SCHEMA_DESTINATION Database schema of the destination table
SOURCE_FILE_FOLDER Path of the folder holding the file
TABLE_DESTINATION Name of the destination table
TEXT_QUALIFIER Text qualifier in the csv ( generally ")
TEXT_SEPARATOR Text separator in the csv (generally ,)

Now we will map the variable of the Package with the SSIS Environment variable.

Then for each parameter of the package to need to map it to a parameter of the SSIS environment.

So right click on the package ->Configure->References ( You can have multiple environment for the same package ).

Add the SSIS environment to the package.

Then go to Parameters and on each parameters click on the three dots.

Select the correct mapping variable.

Apply for all the parameters of the package.

You can do the same with the Connections if these change between each file it worth parameterize them.

Now create a new Job.Create a step with the package. Select SSIS package in the list of type of step.Select your package then go to configuration.Then select the corresponding SSIS environment for this file load.

Last step is to create the create the log table using this script.

CREATE TABLE [dbo].[log_file_insert](

[id] [int] IDENTITY(1,1) NOT NULL,

[file_type] [nvarchar](500) NULL,

[file_name] [nvarchar](4000) NULL,

[date_process] [datetime2](7) NULL,

[status] [int] NULL,

CONSTRAINT [PK_log_file_insert] PRIMARY KEY CLUSTERED

(

[id] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]