The most common task in an ETL process is to determine what to update, what to delete and what to insert using criteria that are specific to what you are doing.
One of the routine tasks in the asset management master database ETL project is to use a composite key of host name + serial number to determine if an asset needs to be updated from a network auto-discovery tool, or needs to be created as a new asset. No assets will be deleted though. They can be de-commissioned in variety of ways if they meet certain criteria.
One QA technique I use routinely is based on an ETL principal, that is “ETL process should not create new data”, or what I called “no more, no less”.
Suppose I have the following 5 milestone points and staging.
- Extracting raw data
- Cleansing and standardizing and integrating
- Conforming data to the master database
- Delivering to the target
- Process reporting
If I start from the stage 1 and get these row counts from the delivering stage:
- Insert: 2,000
- Update: 300,000
Now if I re-start from stage 2 (skip stage 1), my process should not create any new data, since my source data remains the same. I’d expect my row counts look like this:
- Insert: 0
- Update: 302,000 (300,000 + 2,000)
This simple QA technique helped me tremendously in
1) debugging my own process
2) also discovering new patterns in the data.