We have a table that is going to be say 100 million to a billion rows (Table name: Archive)
This table will be referenced from another table, Users.
We have 2 options for the primary key on the Archive table:
option 1: dataID (bigint)
option 2: userID + datetime (4 byte version).
Schema:
Users - userID (int)
Archive - userID - datetime
OR
Archive - dataID (big int)
Which one would be faster?
We are shying away from using Option#1 because bigint is 8 bytes and with 100 million rows that will add up to allot of storage.
Update Ok sorry I forgot to mention, userID and datetime have to be regardless, so that was the reason for not adding another column, dataID, to the table.
-
Some thoughts, but there is probably not a clear cut solution:
If you have a billion rows, why not use int which goes from -2.1 billion to +2.1 billion?
Userid, int, 4 bytes + smalldatetime, 4 bytes = 8 bytes, same as bigint
If you are thinking of userid + smalldatetime then surely this is useful anyway. If so, adding a surrogate "archiveID" column will increase space anyway
Do you require filtering/sorting by userid + smalldatetime?
Make sure your model is correct, worry about JOINs later...
R. Bemrose : As an alternative to the first bullet, how about an unsigned int, which goes from 0 to 4.2 billion?gbn : SQL Server does not have unsigned int :-) -
What's with option 3: Making dataID a 4 byte int?
Also, if I understand it right, the archive table will be referenced from the users table, so it wouldn't even make much sense to have the userID in the archive table.
-
I recommend that you setup a simulation to validate this in your environment, but my guess would be that the single bigint would be faster in general; however when you query the table what are you going to be querying on?
If I was building an arhive, I might lean to having an autoincrement identity field, and then using a partioning scheme to partion based on DateTime and perhaps userid but that would depend on the circumstance.
-
Concern: Using UserID/[small]datetime carries with it a high risk of not being unique.
Here is some real schema. Is this what you're talking about?
-- Users (regardless of Archive choice) CREATE TABLE dbo.Users ( userID int NOT NULL IDENTITY, <other columns> CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID) ) -- Archive option 1 CREATE TABLE dbo.Archive ( dataID bigint NOT NULL IDENTITY, userID int NOT NULL, [datetime] smalldatetime NOT NULL, <other columns> CONSTRAINT <name> PRIMARY KEY CLUSTERED (dataID) ) -- Archive option 2 CREATE TABLE dbo.Archive ( userID int NOT NULL, [datetime] smalldatetime NOT NULL, <other columns> CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID, [datetime] DESC) ) CREATE NONCLUSTERED INDEX <name> ON dbo.Archive ( userID, [datetime] DESC )
If this were my decision, I would definitely got with option 1. Disk is cheap.
If you go with Option 2, it's likely that you will have to add some other column to your PK to make it unique, then your design starts degrading.
HLGEM : Plus I would never use a datetime in a PK unless there was trigger that it could never be updated atfter the initial insert. That's just an accident waiting to happen.bigint : great point on the uniqueness Rob!
0 comments:
Post a Comment