Incremental loading is not a easy topic. I am just concentrating on a very narrow focus here.
Here is the context of this blog.
- Data Source: relational database in DB2
- Data Destination: a table in SQL database
- Reporting: a SSRS report with direct data pull from the above table
- ETL: Implemented in a SSIS package with some business rules built-in
- Daily Process: newly added data from the data source needs to be loaded into the destination table
Challenge of the ETL Process:
- Need to be re-startable: without any manual setting
- Need to be fast: the source can contain large number of records
- Need to be self-correctable: if source data was corrected and back dated, the destination data should be self-correctable
What it really means is to choose among the following options regarding the incremental daily loading:
- truncate and re-populate the destination table daily
- start from the last date from the destination table: I will need to pass the last date from my SQL table. I can either read it in as a user variable in my SSIS package, and pass it to my query, or I can just “ETL” the last date to the source DB2 database.
- just pull the previous day’s data from the source
- pull the previous 3 days’ data from the source
The first choice is simple enough, but will suffer from poor performance.
The second choice can be a little messy, but it sounds like a good choice.
The third choice is simple enough, but not a good one when considering that our SQL job can potentially fail every day, in which case, data will be missing due to job failure.
The last choice is simple, and it will pull data for the last 3 days even when your job failed in the last 2 days. And it also should be fast. Fortunately SQL jobs are fixed Monday to Friday in our environment.
For the sake of simplicity, I picked the last option for some of my ETL processes. So far, the 3 day rule has worked pretty smoothly.