1) Introduction
- Having only one package for all the data import is easier to maintain,its easier to debug if there is an issue with one file import fixing it will fix it for all the flat file import.
- Less time to develop, why developing hundred package instead of just one ?
- Easier to cross skill among different member of a BI team,all the file import are exactly the same,each member of the team is more likely to debug an issue.
- Easier logic from the Scheduler point of view (Sql Agent ), all the file execute the same package,but the SSIS environment variable is different to handle the specific part of each file.
- Easier to manage different environment,all the change are handle by SSIS environment variable
Firstly we will go through the principals component of the package,Secondly more code details and lastly how to deploy the package.
All the file required can be find here https://bitbucket.org/vdiallonort/generic_file_import
All the file required can be find here https://bitbucket.org/vdiallonort/generic_file_import
2) Overview of the package
For a given pattern let said All the file starting by "Customer", we will loop through all these file load them and archive these file in an separate Archive folder.
All the success and failure and error message will be logged in the database for debugging purpose.
3) Details of each steps
1) Log the start of the process
The first step log the start of the process it use the parameters FILE_TYPE to log what kind of file type is it ( Customer file,Sales,Invoice...),also the File Name ( Customer_20170131.txt ) the id of the row on the log table will be store on the variable LOG_ID as a result of the insert.2) Insert the data of the CSV file into the table
This part is where most of the work reside,
First the task will look at the structure of the file use the header to to the matching with the destination table ( parameter TABLE_DESTINATION ).
Secondly Create a rollback object and store it into an SSIS package variable. The reason to use this way of handling the rollback is to be able to rollback from C# code after any event.The rollback object is stored in a variable so it can be access from anywhere in the package.It give way more flexibility than the traditional way of starting a flow with a Begin transaction in a Sql task.
Lastly insert the data into the table.
In case of error there is an Event handler attached to this task to log any error.
3) In case of Success Move file to archive
The link between extract_and_load and the step Move to archive folder is conditional to the success of the step extract_and_load. This step is a file system task that move the current file to the Archive folder define on the parameter ARCHIVE_FILE_FOLDER. In case of success of the Move file to archive step it will go to the logging step in case of error it will go to the Rollback step ( step detailed in the 3 bis)
3 bis ) In case of failure Rollback
If the step extract_and_load fail any transaction made to the database ( except the one for logging ) will be rollback.
Its an SSIS Script task in c# with the following code :
SqlTransaction tran = (SqlTransaction)Dts.Variables["User::transaction_object"].Value;
tran.Rollback();
Dts.TaskResult = (int)ScriptResults.Success;
What it does is read the SSIS variable that hold the transaction and rollback this transaction.
4) Commit all the transaction in case of success
In case both the loading and the move of the file to the archive folder where successful the transaction is committed.
5) Case of failure
If the Rollback step have been call then the next step will log a failure of the load for this file and move to the next one.
Things to remember
There is couple things interesting to highlight,first there is one transaction per file so if a fail for any reason it doesn't affect the other ones.
The main idea is to be able to run the package twice without affecting the ETL,so this way the file is not inserted twice on the next load on a case of System error blocking the file transfer ( the file been still in the import folder it would have been pick up on the next load ) in order to achieve that if the file can be move the transaction is get rollback.
The main idea is to be able to run the package twice without affecting the ETL,so this way the file is not inserted twice on the next load on a case of System error blocking the file transfer ( the file been still in the import folder it would have been pick up on the next load ) in order to achieve that if the file can be move the transaction is get rollback.
3) How to deploy
The first step is to deploy a Csv library on the server ( and any workstation that use the package ) on a folder call Library on the root of c drive,so you should have C:\Library and under two file CsvHelper.dll and CsvHelper.xml. This library is the one that read csv.
On machine after WIn7 the file CsvHelper.dll file is flag as coming from the internet causing issue later on.In order to fix it,the flag need to be remove.Go to the file properties,and unlock the file.
On machine after WIn7 the file CsvHelper.dll file is flag as coming from the internet causing issue later on.In order to fix it,the flag need to be remove.Go to the file properties,and unlock the file.
For the next phase you will need an existing SSIS catalogue,if you don't have any, open Integrations Services Catalogue using SSMS.
The next phase is to deploy the project using Visual Studio.First change the security option to "Do not save sensitive data".
Right click on the Project,click properties ->Deployment-> Server name and type the name of the server you want to deploy the project on.
Right click on the project and hit "Deploy",if you don't have one you will have to create a folder under the SSIS catalogue to hold your project.
Now you can find your package in the SSIS catalog architecture.
The next step is to create SSIS Environment ( one for each file type ) and map each variable inside the SSIS environment it to the package parameters. Latter on in the SSIS Job you will be able to choose which SSIS environment you want to use for this specific job.
Then we will create the variables for this environment.
Ten variables need to be create all with a String type,list bellow :
- ARCHIVE_FILE_FOLDER This is the path of the folder holding the archive of each file
- CS_destination Connection string of the destination database
- CS_meta Connection string of the meta-database that store the logging information
- FILE_PATTERN File pattern (for example Customer_*.csv will load all the file starting by Customer followed by any character and finishing with .csv)
- FILE_TYPE File type is for logging point of view it will tell what kind of data is stored in theses set of file,for example Customer
- SCHEMA_DESTINATION Database schema of the destination table
- SOURCE_FILE_FOLDER Path of the folder holding the file
- TABLE_DESTINATION Name of the destination table
- TEXT_QUALIFIER Text qualifier in the csv ( generally ")
- TEXT_SEPARATOR Text separator in the csv (generally ,)
Now we will map the variable of the Package with the SSIS Environment variable.
Then for each parameter of the package to need to map it to a parameter of the SSIS environment.
So right click on the package ->Configure->References ( You can have multiple environment for the same package ).
Add the SSIS environment to the package.
Then go to Parameters and on each parameters click on the three dots.
Select the correct mapping variable.
Apply for all the parameters of the package.
You can do the same with the Connections if these change between each file it worth parameterize them.
Now create a new Job.Create a step with the package. Select SSIS package in the list of type of step.Select your package then go to configuration.Then select the corresponding SSIS environment for this file load.
Last step is to create the create the log table using this script.
CREATE TABLE [dbo].[log_file_insert](
[id] [int] IDENTITY(1,1) NOT NULL,
[file_type] [nvarchar](500) NULL,
[file_name] [nvarchar](4000) NULL,
[date_process] [datetime2](7) NULL,
[status] [int] NULL,
CONSTRAINT [PK_log_file_insert] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
Aucun commentaire:
Enregistrer un commentaire