jeudi 9 février 2017

Historization process for datawarehouse

Introduction


Historization is a process of keeping track of data over time,it happen in different place of the data-warehouse. Today we will discuss about the one that is necessary for the audit of the input data inside the warehouse. We will call it the historization of the Staging layer and we will refer it with the HSTG acronym.

Firstly why do we need to do such a process ? We want to track everything that enter in the data-warehouse in order to understand how each data affect the insight the data-warehouse discover. We need to be able to explain figures display in reports to the final user. Source system can experience issue,to maintain the trust of user in the data-warehouse we need to explain impacts of these source system on the reports.


Overview of the process.

Logical point of view 


One of the key point of historical is to maintain an unbreakable timeline. This timeline let you query the table for a key and a date, and it will always answer back with only one row.The result row will  give the information regarding the state of this key at any point in time.The information that come back can be, this key didn't exist at this point in time,but one and only one row is always the result for any existing key.

Any table in the historical layer will have couple of extra-field providing information about history :
  1. Starting point,when did this record start to be able, we will call it META_START_DATETIME
  2. Ending point,when did this record stop to be relevant META_EXPIRY_DATETIME.
  3. Status,( META_FLAG ) this field will tell the status of the record,I for insert ( first insert ), U for Update ( newer version of an existing record ) , D for Delete if this Business Key have been drop in the source.
  4. Active flag to tell which one is the latest row META_ACTIVE_ROW,0 if its inactive 1 if its active.
In order to maintain the timeline for the beginning of time to the end of time two operations need to be perform.
First is to create a dummy row that handle the time when the record was not yet available,
So we will have a record with the exact same data than the first real record but META_START_DATETIME= 1 January 10001 at 00:00:00 am the META_EXPIRY_DATETIME is the date the actual real record become available lets said ( 9 february 2017 ). The OMD_ACTIVE_ROW.

Second the latest record for each key need to have META_EXPIRY_DATETIME set to 9999-12-31 (end of time ).

With this two operations the timeline will never be break, a query like Select * from dbo.customer where customer_key=1 and datetime = ( anydate ) will always result with one and only one  record.

Example of how data change in the HSTG depending of the source system :

Then on the second day when the address change :

On the third day the record get drop on the source system and doesn't exist anymore.



ETL point of view 


Knowing we work with 1:1 relationship between the source system and the Staging the HSTG need to do the historization the same way.As a result each table in the Staging will have a corresponding table in HSTG.


Couple step are required and they are pretty intense in term of computing power:

Full reload of the Stagging table 

Reload all the table that don't hold transaction in the Stagging layer.

Identify non-existing row in the HSTG


The first task is to create a Hash of the full row for every row in the staging, and we will compare to the hash of the full row in hstg ( minus any meta field ) that have the META_ACTIVE_ROW set to 1
We will discard ( but keep record of the number of row ) any hash that already exist.

Identify the row that are a new insert and the row that are an update

For the row who doesn't exist in the HSTG,there is two possibilities :
    1. These row are a newer version of the same concept ( Business Key ) for example Customer 123 change address,we already have row for Customer 123 but the address change. So in this case the Business Key is already existing in the HSTG,these row will be tag with a U inside META_FLAG
    2. The row represent a new concept ( Business Key ) which we don't have already.In this case the Business Key doesn't exist these row will be flag with an I inside the META_FLAG.
In order to deal with the deleted key the approach is going the other way than for the updated and inserted key. We taking all the hash of the HSTG and compare them to the hash of the STG.The one that don't exist are the one that have been deleted.


Conclusion

In this article we saw how we can handle historic data from a logical point of view, in order to prove and backtrack any figure coming out of the data-warehouse. In the next article we will discuss the various way to implement this within Microsoft technology.