15

I know this type of question comes up a lot, but I've yet to read any compelling arguments to help me make this decision. Please bear with me!

I have a huge database - it grows by about 10,000,000 records per day. The data is relational, and for performance reasons I load the table with BULK COPY. For this reason, I need to generate keys for the rows, and cannot rely on an IDENTITY column.

A 64-bit integer - a bigint - is wide enough for me to use, but in order to guarantee uniqueness, I need a centralised generator to make my IDs for me. I currently have such a generator service which allows a service to reserve X sequence numbers and guarantees no collisions. However, a consequence of this is that all the services I have are reliant on this one centralised generator, and so I'm limited in how I can distribute my system and am not happy about the other dependencies (such as requiring network access) imposed by this design. This has been a problem on occasion.

I'm now considering using sequential GUIDs as my primary keys (generated externally to SQL). As far as I've been able to ascertain from my own testing, the only drawback to these is the disk space overhead of a wider data type (which is exacerbated by their use in indexes). I've not witnessed any discernible slowdown in query performance, compared to the bigint alternative. Loading the table with BULK COPY is slightly slower, but not by much. My GUID-based indexes are not becoming fragmented thanks to my sequential GUID implementation.

Basically, what I want to know is if there are any other considerations I may have overlooked. At the moment, I'm inclined to take the leap and start using GUIDs. I'm by no means a database expert, so I'd really appreciate any guidance.

Barguast
  • 341
  • 2
  • 6
  • 2
    How would you generate a "sequential GUID"? –  Dec 28 '12 at 00:00
  • It's a custom implementation. It's basically a GUID-type format which has 6 bytes replaced with timestamp bytes, and 2 bytes which represents an sequence number where the timestamp is the same. It isn't guaranteed to produce perfect sequential values, but it is good enough to make index fragmentation a non-issue for me. –  Dec 28 '12 at 00:12
  • Are you therefore loading this data in from multiple different sources? I'm also assuming that the index you are worried about fragmenting is the clustered index? –  Dec 28 '12 at 00:21
  • Yes, it is the clustered index. However, as I said in my post I'm not concerned about fragmentation. My own testing has shown that there is zero fragmentation when using my sequential GUID implementation in a single-process environment. When loading the data from different sources (which can happen, and I certainly want to support it for when the system needs scaling up) I expect there will be a small degree of fragmentation due to small time differences, but nothing major. –  Dec 28 '12 at 00:26
  • 2
    If you are going with a sequential GUID, you should look at NEWSEQUENTIALID(). It should do what you want (monotonically increasing), and doesn't rely on custom code. –  Dec 28 '12 at 03:26
  • 2
    Look at Jeremiah Peschka post on The trouble with keys Good read and he's dealt with these implementation many times over. – billinkc Dec 28 '12 at 05:01
  • Using Sequential guids is not far from using a 128 BIT IDENTITY data type (if that existed). So, why don't you just use a BIGINT identity since the size would suffice for your needs? –  Dec 28 '12 at 09:57
  • I said in my question why I can't. I need the keys before the database gets involved. Edit: same goes for why I can't use NEWSEQUENTIALGUID, but that's somewhat irrelevant. I'm happy with my implementation, just concerned about the width of the key. –  Dec 28 '12 at 10:23
  • @billickc. Thanks for the link, that does seem to be describing the same issue as mine. Interesting, they came up with a time stamp + sequence number + semi-random 128-bit key which is essentially what I've done (except they're storing as a decimal, I'm using uniqueidentifier). –  Dec 28 '12 at 10:36
  • Related post here on DBA.SE: GUID vs. INT – Nick Chammas Dec 30 '12 at 06:26

5 Answers5

4

I'm in a similar same situation. Currently, I'm using the sequential GUID approach and have no fragmentation and easy key generation.

I have noticed two disadavantages that caused me to start migrating to bigint:

  1. Space usage. 8 bytes more per index. Multiply that by 10 indexes or so and you get a huge waste of space.
  2. Columnstore indexes do not support GUIDs.

(2) Was the killer for me.

I will now generate my keys like this:

yyMMddHH1234567890

I'll be using a leading date plus hour and having a sequential part after that. That allows me to range-query my data by date without any addition index at all. This is a nice bonus for me.

I'll generate the sequential part of the bigint using a HiLo algorithm that lends itself well to being distributed.

Hope some of this transfers to your situation. I definitely recommend using bigint.

usr
  • 7,330
  • 5
  • 32
  • 58
  • 1
    Marking this as the 'answer', as it's the best fit (and you seem to appreciate what I'm asking and why this isn't as straightforward as it might first appear). I think I'm going to go with a shared sequence generator (which will work similarly to your HiLo algorithm suggestion). I have this working on another system with few problems, I'll just have to put up with the extra dependency. Oh well. Thanks. – Barguast Jan 02 '13 at 14:29
3

With a type INT, starting at 1, you get over 2 billion possible rows - that should be more than sufficient for the vast majority of cases. With BIGINT, you get roughly 922 quadrillion (922 with 15 zeros - 922'000 billions) - enough for you??

If you use an INT IDENTITY starting at 1, and you insert a row every second, you need 66.5 years before you hit the 2 billion limit ....

If you use a BIGINT IDENTITY starting at 1, and you insert one thousand rows every second, you need a mind-boggling 292 million years before you hit the 922 quadrillion limit ....

Using your 10 million rows per day, that'll take you have enough numbers for roughly 1'844'674'407'370 days (1844 billion days or a tick over 5 billion years) of data - is that good enough for your needs?

Read more about it (with all the options there are) in the MSDN Books Online.

marc_s
  • 8,932
  • 6
  • 45
  • 51
  • 1
    The insertion rate of 10 million rows per day would exhaust the INT range in 200 days. – mceda Dec 28 '12 at 09:19
  • @mceda: yes - did I claim anything else? It doesn't exhaust the BIGINT range that quickly, though .... – marc_s Dec 28 '12 at 09:43
  • Thanks, but as I said in my question, I need the IDs before they are sent to the database. The data is relational, so I need to assigned primary and foreign keys before it is bulk copied. If it wasn't for that, an IDENTITY BIGINT would probably be perfect. –  Dec 28 '12 at 10:21
  • 2
    @Barguast: couldn't you just bulk-insert your data into a staging table (without the identity) and then move it from there into your actual data tables using BIGINT IDENTITY ? – marc_s Dec 28 '12 at 11:30
  • @marc_s: yes, the calculation provided was not aligned with the question: "If you use an INT IDENTITY starting at 1, and you insert a row every second, you need 66.5 years before you hit the 2 billion limit". – mceda Dec 28 '12 at 11:55
  • @mceda: yes that was from a previous response - but I did adapt the BIGINT scenario since that seemed more appropriate here (seeing that 200 days for INT was probably not enough) – marc_s Dec 28 '12 at 11:59
  • @marc_s: Regarding your staging table suggestion, how would this work? I can imagine adding the GUID-keyed rows to the staging table, and then mapping the GUIDs to bigints internally for the primary keys, and then again for the foreign keys in associated tables. Then I'd need to get the bigint keys back out again. I can imagine it working, and may give it a go. However, it seems a bit complex in comparison to the alternative and I'd be worried about the extra overhead involved given the performance requirements imposed by the volume of data. –  Dec 28 '12 at 12:36
2

I recommend you use SEQUENCE of BIGINT data type in SQL 2012 This is much more flexible than IDENTITY with options like cache/nocache, you can also assign a sequence range for your batch operation as sp_sequence_get_range.

2

Is the reason you can't use IDENTITY because there are already foreign key relationships between separate tables you are loading? And there is no other natural key for you to be able to link them up in an operation from a staging area to the production area? For that reason, I'd like to know a little more about how they are currently "linked" in the source system before you bulk copy? Do multiple source systems simply use their own sequences and have the possibility for conflicting sequences when brought into a shared database?

The COMB ID/sequential GUID technique is one which I am familiar with, and it is workable any time you effectively need that global uniqueness assigned outside the database - it's effectively a usable row identity both inside and outside the database. For that reason, in highly-distributed environments or disconnected scenarios, it's an OK choice

Except if you really don't need it, Because that extra width difference is significant when the size of the data grows and these keys are in every index and the working sets for a lot of queries.

Also, with generation distributed, if the rows don't actually come in the order of the GUID column, the issues with using this for the clustered index key (narrow, static, increasing) potentially causing some fragmentation compared to clustering on an IDENTITY still remain.

Cade Roux
  • 6,602
  • 1
  • 31
  • 54
0

In general it's possible to use OUTPUT clause of INSERT command to have data inserted into both tables and related with identity field.

Identifier which is based on timestamp should not be considered reliable - it depends on system clock which in turn depends on many things - from hardware clock to time synchronization services.

Serg
  • 141
  • 1
  • 3