SQL Server : ETL Stage Index

Question

We have a very large staging table (> 80 GB). From our source system we load invoice data in the staging table. From the staging we transform the data and load into DWH/Fact. Every day we delete the current month, then reload from source into stage. The stage contains complete history over time.

In some DW loads we only need the current month. Sometimes year and previous year.

What is a better index strategy:

Clustered index on a date column (Fiscal Period)
Primary key with IDENTITY as surrogate key
Clustered index for natural key (some kind of line item e.g. invoice number)

All queries contain the date column (Fiscal Period) and sometimes additional columns like Invoice type as non-clustered index. In the ETL we can disable the non-clustered index but not the clustered index.

Which of the three types has the best performance for:

Insert into Stage table
Query the Stage table

Refer to the data loading performance guide for excellent details on how to make this process as fast as possible. Also related: What is the fastest way to insert large numbers of rows? — Hannah Vernon, Apr 06 '16 at 19:43
Presumably you are actually asking about speed when you talk about "what is the better index strategy" - the answer I linked this question to has excellent details on both speed of inserts, and how to look at indexing. If you add details around exactly what you want to accomplish, I will consider re-opening the question. — Hannah Vernon, Apr 06 '16 at 20:12
The linked question describes disabling/rebuilding for non-clustered indexes. My question is for an clustered Index. I added some details, hope it describes more what I want to accomplish. — LuckyStrike, Apr 06 '16 at 20:36
For your staging/historical table, do you need more than one month's worth of data to satisfy the DW load? — billinkc, Apr 06 '16 at 20:47
@billinkc In some DW loads we only need the current month. Sometimes Year and previous year. — LuckyStrike, Apr 06 '16 at 21:08

score 1 · Answer 1 · answered Jul 28 '17 at 18:53

If it possible for you to use partitioning, I would highly recommend leveraging it in this situation as you know that you're working with periodic data loads which fall into months or years. With partitioning, you could reserve the creation of your indexes on other columns that you may need on things such as InvoiceNo, etc. Let partitioning take care of isolating which month or period you are working with.

If you are fairly sure about the fact that date is integral portion of your join or searches, then you I would associate a "smartkey" for your loading strategy and place a clustered index on that instead.

If you are truly determined to use the date as an index seek, then I would place an non-clustered index on this field as you'll likely have to scan rows in your use-case anyway. Keep in mind creating a clustered index on date and then incurring updates or deletes will fragment your table.

score 0 · Answer 2 · answered Apr 07 '16 at 00:07

For date key you might prefer a number. So for April 7th 2016 201604 (Or 20160407 in case of extended requirements) might be best for your date. I don't think that in a DW you need an IDENTITY column for a primary key; After all, you're more aggregating then asking to see a specific value. What I do is make sure the index avoids duplication of values: for example, having a clustered index which consists of several columns. This way I can identify a specific row and make sure there are no duplication, and yet avoids the "IDENTITY" which could reach a very large number (and for no usability reason).

SQL Server : ETL Stage Index

2 Answers2