mardi 24 novembre 2020

A cost effective approach to build a datawarehouse in Azure Part 1:

How expensive is it to build a Data-warehouse in Azure ? 

Cloud computing provides a different paradigm while budgeting a data warehouse project. We no longer buy server but consume power. It's the same difference than buying a car or renting one by the hour. So how much power will we consume for a given project? How to reduce this consumption? 

This article is feedback after various Azure data warehouse projects. and is a compilation of services I found cost-effective.

It will details a production environment cost only (other environments depend too much of team size, corporate policies,etc..)


First, let's step back and define a warehouse


A data warehouse is a collection of business-oriented data organized per subject. These data are non-volatile and provide a single version of the truth.

Therefore we will need to accomplish three different goals :

1) Gather all meaningful and useful data
2) Store and create a history of these data, so the insight provided by reports can be challenged and audited.
3) Provide an easy way for a business user to self serve their need for analytics and reporting.

This simplified architecture shows how each Azure service interacts:


More information about this kind of architecture can be found here

 In order to provide a meaningful example, we will consider the real-life example of a medium-size company and look at sales ( sales order and invoices ) and  Stock Management. 

The volume of data to deal with is at a high level ( each area combine multiple tables ) :

Business area

Number lines

Size in GB

Products

116 000

0,156

Sales

1,6 million

1,76

Stock

3 million

2


Gather data



One of the cost-effective ways Azure have to get data is Azure Data factory leveraging Blob storage.
00
Azure Data factory can be both very expensive or very cheap depending on how you use it. The combo of dataflow and running the compute IR on Azure (Azure-hosted Integration runtime) could be insanely expensive and end up costing over 9000 USD per month to load and create a history every night just for couple millions of rows. So we will pass this idea, although from an architecture point of view this is very seductive.

Instead, we will use Data factory to load data in blob storage (which it does without breaking the bank) and organize the dataflow. Azure synapse will be used to load the data from the blob, ingest and transform these data.

This operation will cost around 4700 USD per year in our example.

Store and transform data


There are various options in Azure portfolio to store and manipulate data. First and foremost the cheapest way is pure file system storage like Blob storage, however, this is a great way to load data and gather them but fairly limited when it comes to update data and join them without expensive tools. Database on the other hand are fairly good to manipulate, update, join different datasets with a simple language. SQL.

If we look at the panel of Saas tools in the Azure repository one stand out for data warehouse, it's Azure Synapse.
It has the capability to process a large amount of data, is build for ELT scenario using blob storage as a staging area, and allow the use of multiple languages, SQL,U-SQL,Python,C#.

In our example, we use Synapse to load data from an Odata source, land them in blob storage, load them in the historical layer, organize them into Datavault artefact, use a virtual datamart to finally process them into a Tabular model. With a fair bit of margin, this is done within 5 hours every night using the least expensive option of Synapse, DWU 100 . So our ETL let us take advantage of Saas database by starting it at the very beginning of the ETL then pausing it at the very end. Therefore we are paying the database for only 5 hours. The processing power is 1,510 USD per hour therefore the cost will be 2756 USD per year + 1785 USD for 1 TB of storage per year. So all together storing and transforming the data will cost 4540 USD per year.

Provide meaningful data to business users


One way to let business access to their data is to use Azure Analysis services, this provides a model that process and add business rules so the data can be manipulated from différent tools ( Excel,Powerbi etc ..) and remain logical at all time. For example, a stock state will force a user to use a date and it will not sum up across time. Currency conversion could force the business user to choose a target currency etc...
Experience with users doing a lot of analysis with this layer translate into buying a higher level of performance. I do think B level of Azure Analysis services doesn't provide enough power. Also, bellow standard instance partitioning is not available therefore processing large table can be long as it required to reload the full dataset. Leveraging aggregate is only available with Power Bi premium ( premium include Analysis services on dedicated hardware ) but the price starts at 60 K USD per year which is a significant amount of money if the only goal is to improve performance. S1 instance provides enough power for production workload and handles a load for a medium-size company. The cost of this semantic layer is 18 k USD per year.


Product Lifecycle and CI/CD









Azure DevOps is one of the greatest tools for business intelligence in Azure. Business Intelligence is a peculiar project in IT because the data insight you deliver drive needs for more data package and deeper analysis therefore more project and work package. Therefore it's Continuous Integration/ Continuous Development tools will be a great help and Azure devops is doing that though workload management, ticket management, source control, release pipeline. Cost is rock bottom with the first 5 users free and after it's 72 USD per year per user.


Conclusion


This production environment warehouse example is the bare minimum you will need, it doesn't have any Master Data Management for example and will use excel file in blob storage or file share to let business users add mapping or reference information. Adding a good MDM software is a strong asset for any data warehouse project. However, we can have the skeleton of the data warehouse for  27 K USD per year.