Getting Started with Adobe After Effects - Part 6: Motion Blur
First Time? You can support us by signing up. It takes only 5 seconds. Click here to sign up. If you already have an account, click here to login.
Loading

1st Prize - Apple iPad


SQLServer Quiz 2011 - Use of Foreign Key constraints

  • Should Foreign Key constraints be used in a Data Warehouse?

    Posted on 06-07-2011 00:00 |
    Mike Lewis
    42 · 4% · 1336

28  Answers  

Subscribe to Notifications
Previous 1 | 2 | 3 Next
  • Score
    1

    No

    Replied on Jun 13 2011 12:43AM  . 
    Kannan KR
    679 · 0% · 48
  • Score
    7

    It depends - if your warehouse is small and you do not mind the performance hit when inserting / updating via your ETL, then you can use them. Data integrity is always the most important part of a system. Or, if your system is larger and you plan to only insert / update via your ETL solution you can drop or disable them, load, then re-enable them. Or, you can have them but keep them disabled.

    I prefer to have the relationships defined but disable them so 1) it is self documenting [the next guy can easily tell what relates to what], 2) tools can read the relationships [Analysis Services for building relationships in your DSV, or Visio for diagramming], and 3) you can easily turn them back on to do a sanity check when necessary.

    end answer : always have them defined, it is just whether you want to enable them and can handle the performance hit.

    Replied on Jun 13 2011 6:48AM  . 
    JohnHennesey
    2269 · 0% · 5
  • Score
    2

    yes

    Replied on Jun 14 2011 3:59AM  . 
    Alpesh Patel
    36 · 5% · 1663
  • Score
    1

    NO. Foreign Key constraints should not be used in a Data Warehouse.

    Replied on Jun 23 2011 5:14AM  . 
    Dinesh
    636 · 0% · 53
  • Score
    8

    The question about implementing Foreign Keys in a data warehouse or data mart is misleading.

    The real question, implied by the use of Foreign Keys, is: Who should be responsible for enforcing relational integrity in a data warehouse or data mart: the ETL process or the database engine?

    Let's use the following example: You want to load 100,000 rows. Of these 100,000 rows, 10 rows contain data without valid foreign keys.

    I suggest that it should be the responsibility of the ETL process. This allows orphaned child records to be handled without breaking the load process. If you are loading 100,000 rows, 10 of which are orphans, 999,990 rows will be successfully loaded. The 10 offensive rows can be forwarded to an error table.

    If foreign key constraints are implemented at the database level and an attempt is made to insert an orphan, the batch fails and is rolled back. That means none of the rows, even the rows which contain valid data, will be loaded. Another option is to load only those rows with valid foreign key values, which requires a lookup prior to loading the batch.

    Either way, you've likely just blown your window.

    I do not suggest dispensing with relational integrity concerns with data marts or data warehouses. But we must acknowledge that source systems are going to have data quality issues and you should be prepared to deal with them. I believe the appropriate place to deal with these issues are the ETL processes that load the data warehouses or data marts.

    Replied on Jun 24 2011 5:36PM  . 
    Marc Jellinek
    97 · 2% · 545
  • Score
    0

    "I suggest that it should be the responsibility of the ETL process. This allows orphaned child records to be handled without breaking the load process. If you are loading 100,000 rows, 10 of which are orphans, 999,990 rows will be successfully loaded. The 10 offensive rows can be forwarded to an error table."

    Just wondering - after a successful load how do you continue to ensure data integrity, or can it be guaranteed that data integrity will not be broken by DBAs, developers, or additional processes once the load is complete?

    Replied on Jun 24 2011 9:41PM  . 
    Michael Coles
    156 · 1% · 302
  • Score
    0

    @Michael

    "how do you continue to ensure data integrity"

    A well governed set of ETL processes. Granted, this is more work than just setting FK constraints.

    "can it be guaranteed that data integrity will not be broken by DBAs, developers or additional processes"

    You can hedge your bets on this through rights assignments. No one, other than the security context used by the ETL processes should have INSERT rights onto the tables that make up a data warehouse or data mart. Prior to granting INSERT rights to the tables, the developers, DBAs and processes should be educated and vetted as to their responsibilities (one of which would be to validate relational integrity prior to loading data, and forwarding offensive rows to error table(s)).

    There are a discrete number of things that a data warehouse or data mart should do well. These are:

    1. Accept data as quickly as possible.
    2. Make data available as quickly as possible.
    3. Make sure the data is consistent.

    Foreign keys are a block to #1 and do not add value to #2. FKs do add value to #3 but at the cost of #1.

    Replied on Jun 25 2011 9:48AM  . 
    Marc Jellinek
    97 · 2% · 545
  • Score
    0

    @Marc:

    To summarize:

    1. You do not consider ongoing data integrity to be a concern ("load it and forget it")
    2. You are willing to bet the farm that anyone with appropriate rights to perform DML against your tables will "do the right thing" and manually check data integrity after every DML change; and that programmers will duplicate data integrity validations in all of their code. I'll assume -- rightly or wrongly -- that you are considering permissions to DELETE and UPDATE, despite focusing solely on INSERT. Please advise if I'm wrong and you do not care about the effects of DELETE and UPDATE on data integrity.
    3. You feel the primary function of a data warehouse or data mart is fast load speed. You think the secondary function if a data warehouse or data mart is fast query speed. Data consistency is last on your list.

    Some might argue the opposite - that "correct results" are the singular most important business function of any data warehouse. A cynical man might say that you can generate fast results with a random number generator, but "correct results" require a little bit more time and effort.

    One might continue to argue that how long it takes your ETL process to load data is classified as a "nonfunctional requirement". When the CEO complains that the reports in the reporting dashboard are completely wrong, it's tough to argue the position that you can't permanently fix it or ensure it won't happen again because it will bump the ETL load time from 30 minutes to 1 hour.

    I'll have to admit that I'm guessing that your statement "Make data available as quickly as possible" is not simply a repetitive restatement of "Accept data as quickly as possible". I'm also guessing that you're referencing query speed in this statement. Obviously this might be a severe oversimplification since wall clock speed is an extremely simple metric to measure your code by. Particularly if you have multiple processes running on the same server, since it doesn't account for factors like server resource usage, etc. So under the assumption that you only are concerned with query time wall-clock speed, foreign keys in the database can help. SQL Server can use the additional information provided by foreign key constraints to optimize your queries. In the worst case scenario foreign key constraints should not have an adverse effect on the speed of your queries.

    Replied on Jun 25 2011 1:29PM  . 
    Michael Coles
    156 · 1% · 302
  • Score
    0

    @Michael

    You are willing to bet the farm that anyone with appropriate rights to perform DML against your tables will "do the right thing" and manually check data integrity after every DML change; and that programmers will duplicate data integrity validations in all of their code. I'll assume -- rightly or wrongly -- that you are considering permissions to DELETE and UPDATE, despite focusing solely on INSERT. Please advise if I'm wrong and you do not care about the effects of DELETE and UPDATE on data integrity.

    Anyone will appropriate rights to perform DML changes against a data warehouse should be following proper change manage procedures and implement approved development standards. Development standards and a change management process are critical to the success of a data warehouse project.

    As for UPDATE and DELETE, data warehouses are additive only. Only in the case of using SCD Type 1 should there be updates, but never deletes. Using SCD Type 2 or 4 (Slowly Changing Dimensions), a data warehouse should only have INSERTs. I'll rarely use SCD Type 1 because you lose all history of changes and this tends to defeat the purpose of a data warehouse.

    You feel the primary function of a data warehouse or data mart is fast load speed. You think the secondary function if a data warehouse or data mart is fast query speed. Data consistency is last on your list.

    Please do not use my numbering to imply primary, secondary and tertiary requirements. They are all important. When the fulfillment of one requirement impedes the realization of another requirement, another option must be considered. The use of FK constraints to enforce relational integrity of a data warehouse may impede the realization of loading data within the allowed window. If there is another way to ensure relational consistency that allows for the loading of data within the window, that option should be explored.

    One might continue to argue that how long it takes your ETL process to load data is classified as a "nonfunctional requirement". When the CEO complains that the reports in the reporting dashboard are completely wrong, it's tough to argue the position that you can't permanently fix it or ensure it won't happen again because it will bump the ETL load time from 30 minutes to 1 hour.

    If the difference in load times between using FKs and not using FKs is the difference between a 30 minute load and a 60 minute load, then I'd see the difference as begin trivial. But I usually deal with much larger sets of data. The rules under which I usually have to work are: if the ETL process does not fit within the proscribed window, the ETL process is to abort and roll back. This yields out-of-date data, which in some cases is as bad, if not worse, than incorrect data.

    Please note that I do not advocate loading incorrect data, I simply advocate using the ETL process to ensure relational integrity rather than using FK constraints.

    I'll have to admit that I'm guessing that your statement "Make data available as quickly as possible" is not simply a repetitive restatement of "Accept data as quickly as possible".

    I'm differentiating loading data from querying data.

    I've read recommendations that data within a data warehouse should not be indexed. This would conflict with the need to make data available as quickly as possible, implying that the data be indexed. Prior to ETL loads, I'll drop indexes on target tables, load the data, then re-apply the indexes. This allows the data to be loaded as quickly as possible (because each commit will not update indexes), and be available as quickly as possible (because the data is indexed after it is loaded).

    These recommendations are generic and apply to data that are being loaded in bulk during an ETL window.

    As usual... "IT DEPENDS". For a counter-example, I've done realtime data warehouses that required data be available for querying while being trickle-loaded. Because data was not loaded in volume (maybe a few hundred rows per minute) and needed to be constantly available, it did not make sense to drop and re-create indexes.

    So under the assumption that you only are concerned with query time wall-clock speed, foreign keys in the database can help. SQL Server can use the additional information provided by foreign key constraints to optimize your queries. In the worst case scenario foreign key constraints should not have an adverse effect on the speed of your queries.

    I've heard this in the past, but have never been able to confirm it. Please see the example below:

    Without Keys:

    IF EXISTS (SELECT * FROM sys.tables WHERE name = 'child') DROP TABLE child
    IF EXISTS (SELECT * FROM sys.tables WHERE name = 'parent') DROP TABLE parent
    
    CREATE TABLE parent 
        (
        	id int, 
        	name varchar(255)
        )
    GO
    
    DECLARE @id int = 0
    
    WHILE @id < 1000000
        BEGIN
        	INSERT INTO parent(id, name) VALUES (@id, 'parent ' + CAST(@id as varchar(255)))
        	SET @id += 1
        END
    GO
    
    CREATE TABLE child 
        (
        	id int, 
        	parentid int, 
        	name varchar(255)
        )
    GO
    
    DECLARE @parentid   int = 0
    DECLARE @childid    int = 0 
    
    WHILE @parentid < 1000000
        BEGIN
        	SET @childid = 0 
        	WHILE @childid < 10
        		BEGIN
        			INSERT INTO child(id, parentid, name) VALUES (@childid, @parentid, 'child (' + CAST(@childid as varchar(255)) + ') of parent (' + CAST(@parentid as varchar(255)) + ')')
        			SET @childid += 1
        		END
        	SET @parentid += 1
        END
    
    CREATE UNIQUE CLUSTERED INDEX idxparent ON parent(id)
    GO
    
    CREATE UNIQUE CLUSTERED INDEX idxchild ON child(parentid, id)
    GO
    

    The following query generates a query plan that shows a Clustered Index Scan against parent.idxparent, a Clustered Index Scan against child.idxchild and a merge join between the two.

    SELECT
        p.id, 
        p.name, 
        c.id, 
        c.name
    FROM
        parent p 
        	INNER JOIN child c ON p.id = c.parentid
    

    Now let's see what happens when we use PK and FK constraints:

    IF EXISTS (SELECT * FROM sys.tables WHERE name = 'ChildKeys') DROP TABLE ChildKeys
    IF EXISTS (SELECT * FROM sys.tables WHERE name = 'ParentKeys') DROP TABLE ParentKeys
    
    
    CREATE TABLE ParentKeys
        (
        	id int, 
        	name varchar(255), 
        	CONSTRAINT PKParentKeys PRIMARY KEY (id)
        )
    GO
    
    CREATE TABLE ChildKeys
        (
        	id int, 
        	parentid int, 
        	name varchar(255), 
        	CONSTRAINT PKChildKeys PRIMARY KEY (parentid, id), 
        	CONSTRAINT FKChildKeys FOREIGN KEY (parentid) REFERENCES ParentKeys(id)
        )
    GO
    
    INSERT INTO ParentKeys (id, name)
    SELECT id, name FROM parent
    GO
    
    INSERT INTO ChildKeys (id, parentid, name)
    SELECT id, parentid, name FROM child
    GO
    

    The following query calculates the exact same query plan: two clustered index Scans and a merge join:

    SELECT
        p.id, 
        p.name, 
        c.id, 
        c.name
    FROM
        ParentKeys p 
        	INNER JOIN ChildKeys c ON p.id = c.parentid
    
    Replied on Jun 26 2011 11:46AM  . 
    Marc Jellinek
    97 · 2% · 545
  • Score
    0

    "Anyone will appropriate rights to perform DML changes against a data warehouse should be following proper change manage procedures and implement approved development standards. Development standards and a change management process are critical to the success of a data warehouse project."

    If manually running a bunch of extra queries to duplicate the functionality of FK relationships on a one-off basis is your development standard, then we can bet the farm that everyone is doing that. We can also bet that no one will make any mistakes. Ever.

    "As for UPDATE and DELETE, data warehouses are additive only. Only in the case of using SCD Type 1 should there be updates, but never deletes. Using SCD Type 2 or 4 (Slowly Changing Dimensions), a data warehouse should only have INSERTs. I'll rarely use SCD Type 1 because you lose all history of changes and this tends to defeat the purpose of a data warehouse."

    I'm interested to know how you restrict sysadmins from issuing UPDATE and DELETE statements against tables.

    Thank goodness you don't have any metadata/process control tables in that database that occasionally require maintenance in the form of updates and deletions.

    "Please do not use my numbering to imply primary, secondary and tertiary requirements."

    Sorry about that, I often use numbering when I want to explicitly provide ordering or imply precedence.

    "They are all important. When the fulfillment of one requirement impedes the realization of another requirement, another option must be considered."

    There's an old saying that goes something like "When everything is a priority, nothing is a priority." Because back-end development often involves a series of trade-offs, I find it useful to prioritize the functional and nonfunctional requirements. But to each his own.

    "The use of FK constraints to enforce relational integrity of a data warehouse may impede the realization of loading data within the allowed window. If there is another way to ensure relational consistency that allows for the loading of data within the window, that option should be explored."

    So as I said, you value speed of loading any data over the ongoing integrity of good data. This is fine if it's what you want to do, it's not for me.

    "I'm differentiating loading data from querying data."

    I've read recommendations that data within a data warehouse should not be indexed. This would conflict with the need to make data available as quickly as possible, implying that the data be indexed. Prior to ETL loads, I'll drop indexes on target tables, load the data, then re-apply the indexes. This allows the data to be loaded as quickly as possible (because each commit will not update indexes), and be available as quickly as possible (because the data is indexed after it is loaded).

    These recommendations are generic and apply to data that are being loaded in bulk during an ETL window.

    As usual... "IT DEPENDS". For a counter-example, I've done realtime data warehouses that required data be available for querying while being trickle-loaded. Because data was not loaded in volume (maybe a few hundred rows per minute) and needed to be constantly available, it did not make sense to drop and re-create indexes.

    I've loaded extremely large data warehouses and data marts in the past, and have found that dropping and recreating indexes does help performance in most cases. But again, my singular primary goal when I design a system is to meet the functional requirements first: such as delivering correct results.

    "I've heard this in the past, but have never been able to confirm it. Please see the example below:"

    Your example seems to demonstrate one of the cases I mentioned where FK constraints did not degrade performance. For examples where it can help you can Google "star join optimization", for instance. The Star Join Optimization can use explicit FK constraints to optimize your queries by eliminating unmatched Fact-Dimension join rows early in the process. To be fair FK constraints aren't always necessary to get Star Join Optimization, but MS recommends it so the optimizer doesn't have to rely on query heuristics to determine whether a Star Join Optimization can be applied.

    Even without the Star Join Optimization, though, a trusted FK can provide additional information to the optimizer. Here's a good discussion of how a trusted vs. nontrusted FK constraint can make a big difference, for example: [http://michaeljswart.com/2008/06/keep-your-constraints-trusted/]http://michaeljswart.com/2008/06/keep-your-constraints-trusted/[1]

    So now you should be able to confirm it.

    Replied on Jun 26 2011 12:31PM  . 
    Michael Coles
    156 · 1% · 302
Previous 1 | 2 | 3 Next

Your Answer


Sign Up or Login to post an answer.
Please note that your answer will not be considered as part of the contest because the competition is over. You can still post an answer to share your knowledge with the community.

Copyright © Rivera Informatic Private Ltd.