Monday, April 25, 2011

JOIN Performance: Composite key versus BigInt Primary Key

We have a table that is going to be say 100 million to a billion rows (Table name: Archive)

This table will be referenced from another table, Users.

We have 2 options for the primary key on the Archive table:

option 1: dataID (bigint)

option 2: userID + datetime (4 byte version).

Schema:

Users - userID (int)

Archive - userID - datetime

OR

Archive - dataID (big int)

Which one would be faster?

We are shying away from using Option#1 because bigint is 8 bytes and with 100 million rows that will add up to allot of storage.

Update Ok sorry I forgot to mention, userID and datetime have to be regardless, so that was the reason for not adding another column, dataID, to the table.

From stackoverflow
  • Some thoughts, but there is probably not a clear cut solution:

    • If you have a billion rows, why not use int which goes from -2.1 billion to +2.1 billion?

    • Userid, int, 4 bytes + smalldatetime, 4 bytes = 8 bytes, same as bigint

    • If you are thinking of userid + smalldatetime then surely this is useful anyway. If so, adding a surrogate "archiveID" column will increase space anyway

    • Do you require filtering/sorting by userid + smalldatetime?

    • Make sure your model is correct, worry about JOINs later...

    R. Bemrose : As an alternative to the first bullet, how about an unsigned int, which goes from 0 to 4.2 billion?
    gbn : SQL Server does not have unsigned int :-)
  • What's with option 3: Making dataID a 4 byte int?

    Also, if I understand it right, the archive table will be referenced from the users table, so it wouldn't even make much sense to have the userID in the archive table.

  • I recommend that you setup a simulation to validate this in your environment, but my guess would be that the single bigint would be faster in general; however when you query the table what are you going to be querying on?

    If I was building an arhive, I might lean to having an autoincrement identity field, and then using a partioning scheme to partion based on DateTime and perhaps userid but that would depend on the circumstance.

  • Concern: Using UserID/[small]datetime carries with it a high risk of not being unique.

    Here is some real schema. Is this what you're talking about?

    -- Users (regardless of Archive choice)
    CREATE TABLE dbo.Users (
        userID      int           NOT NULL  IDENTITY,
        <other columns>
        CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID)
    )
    
    -- Archive option 1
    CREATE TABLE dbo.Archive (
        dataID      bigint        NOT NULL  IDENTITY,
        userID      int           NOT NULL,
        [datetime]  smalldatetime NOT NULL,
        <other columns>
        CONSTRAINT <name> PRIMARY KEY CLUSTERED (dataID)
    )
    
    -- Archive option 2
    CREATE TABLE dbo.Archive (
        userID      int           NOT NULL,
        [datetime]  smalldatetime NOT NULL,
        <other columns>
        CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID, [datetime] DESC)
    )
    CREATE NONCLUSTERED INDEX <name> ON dbo.Archive (
        userID,
        [datetime] DESC
    )
    

    If this were my decision, I would definitely got with option 1. Disk is cheap.

    If you go with Option 2, it's likely that you will have to add some other column to your PK to make it unique, then your design starts degrading.

    HLGEM : Plus I would never use a datetime in a PK unless there was trigger that it could never be updated atfter the initial insert. That's just an accident waiting to happen.
    bigint : great point on the uniqueness Rob!

0 comments:

Post a Comment