You are willing to bet the farm that
anyone with appropriate rights to
perform DML against your tables will
"do the right thing" and manually
check data integrity after every DML
change; and that programmers will
duplicate data integrity validations
in all of their code. I'll assume --
rightly or wrongly -- that you are
considering permissions to DELETE and
UPDATE, despite focusing solely on
INSERT. Please advise if I'm wrong and
you do not care about the effects of
DELETE and UPDATE on data integrity.
Anyone will appropriate rights to perform DML changes against a data warehouse should be following proper change manage procedures and implement approved development standards. Development standards and a change management process are critical to the success of a data warehouse project.
As for UPDATE and DELETE, data warehouses are additive only. Only in the case of using SCD Type 1 should there be updates, but never deletes. Using SCD Type 2 or 4 (Slowly Changing Dimensions), a data warehouse should only have INSERTs. I'll rarely use SCD Type 1 because you lose all history of changes and this tends to defeat the purpose of a data warehouse.
You feel the primary function of a
data warehouse or data mart is fast
load speed. You think the secondary
function if a data warehouse or data
mart is fast query speed. Data
consistency is last on your list.
Please do not use my numbering to imply primary, secondary and tertiary requirements. They are all important. When the fulfillment of one requirement impedes the realization of another requirement, another option must be considered. The use of FK constraints to enforce relational integrity of a data warehouse may impede the realization of loading data within the allowed window. If there is another way to ensure relational consistency that allows for the loading of data within the window, that option should be explored.
One might continue to argue that how
long it takes your ETL process to load
data is classified as a "nonfunctional
requirement". When the CEO complains
that the reports in the reporting
dashboard are completely wrong, it's
tough to argue the position that you
can't permanently fix it or ensure it
won't happen again because it will
bump the ETL load time from 30 minutes
to 1 hour.
If the difference in load times between using FKs and not using FKs is the difference between a 30 minute load and a 60 minute load, then I'd see the difference as begin trivial. But I usually deal with much larger sets of data. The rules under which I usually have to work are: if the ETL process does not fit within the proscribed window, the ETL process is to abort and roll back. This yields out-of-date data, which in some cases is as bad, if not worse, than incorrect data.
Please note that I do not advocate loading incorrect data, I simply advocate using the ETL process to ensure relational integrity rather than using FK constraints.
I'll have to admit that I'm guessing
that your statement "Make data
available as quickly as possible" is
not simply a repetitive restatement of
"Accept data as quickly as possible".
I'm differentiating loading data from querying data.
I've read recommendations that data within a data warehouse should not be indexed. This would conflict with the need to make data available as quickly as possible, implying that the data be indexed. Prior to ETL loads, I'll drop indexes on target tables, load the data, then re-apply the indexes. This allows the data to be loaded as quickly as possible (because each commit will not update indexes), and be available as quickly as possible (because the data is indexed after it is loaded).
These recommendations are generic and apply to data that are being loaded in bulk during an ETL window.
As usual... "IT DEPENDS". For a counter-example, I've done realtime data warehouses that required data be available for querying while being trickle-loaded. Because data was not loaded in volume (maybe a few hundred rows per minute) and needed to be constantly available, it did not make sense to drop and re-create indexes.
So under the assumption that you only
are concerned with query time
wall-clock speed, foreign keys in the
database can help. SQL Server can use
the additional information provided by
foreign key constraints to optimize
your queries. In the worst case
scenario foreign key constraints
should not have an adverse effect on
the speed of your queries.
I've heard this in the past, but have never been able to confirm it. Please see the example below:
IF EXISTS (SELECT * FROM sys.tables WHERE name = 'child') DROP TABLE child
IF EXISTS (SELECT * FROM sys.tables WHERE name = 'parent') DROP TABLE parent
CREATE TABLE parent
DECLARE @id int = 0
WHILE @id < 1000000
INSERT INTO parent(id, name) VALUES (@id, 'parent ' + CAST(@id as varchar(255)))
SET @id += 1
CREATE TABLE child
DECLARE @parentid int = 0
DECLARE @childid int = 0
WHILE @parentid < 1000000
SET @childid = 0
WHILE @childid < 10
INSERT INTO child(id, parentid, name) VALUES (@childid, @parentid, 'child (' + CAST(@childid as varchar(255)) + ') of parent (' + CAST(@parentid as varchar(255)) + ')')
SET @childid += 1
SET @parentid += 1
CREATE UNIQUE CLUSTERED INDEX idxparent ON parent(id)
CREATE UNIQUE CLUSTERED INDEX idxchild ON child(parentid, id)
The following query generates a query plan that shows a Clustered Index Scan against parent.idxparent, a Clustered Index Scan against child.idxchild and a merge join between the two.
INNER JOIN child c ON p.id = c.parentid
Now let's see what happens when we use PK and FK constraints:
IF EXISTS (SELECT * FROM sys.tables WHERE name = 'ChildKeys') DROP TABLE ChildKeys
IF EXISTS (SELECT * FROM sys.tables WHERE name = 'ParentKeys') DROP TABLE ParentKeys
CREATE TABLE ParentKeys
CONSTRAINT PKParentKeys PRIMARY KEY (id)
CREATE TABLE ChildKeys
CONSTRAINT PKChildKeys PRIMARY KEY (parentid, id),
CONSTRAINT FKChildKeys FOREIGN KEY (parentid) REFERENCES ParentKeys(id)
INSERT INTO ParentKeys (id, name)
SELECT id, name FROM parent
INSERT INTO ChildKeys (id, parentid, name)
SELECT id, parentid, name FROM child
The following query calculates the exact same query plan: two clustered index Scans and a merge join:
INNER JOIN ChildKeys c ON p.id = c.parentid