Sequential Guid Algorithm – Improving the algorithm

My last post addressed generating Guid values which were sequential across system boundaries. The problem with other algorithms is two-fold. First, they are only sequential for a single system. Second, they are based off the MAC address of the system on which they were generated. When you need to generate client id values and store them in an SQL database these two properties create fragmented indexes and increase the number of page splits.

My last attempt was OK, but a problem with the algorithm was that the values were only sequential to an accuracy of 1 millisecond. While this would not be a problem for most systems, it still bugged me. Even though I could still generate sequential values at a rate of 1000/sec, real world systems are not so kind as to be so evenly spaced. Values are most likely to be farther apart, but generated in bursts.

I wanted an algorithm that was more accurate and allowed for more values to be generated per second, thus reducing page splits and fragmentation even more.

I took some time to research hi-resolution timers and found that the recommended way of using a timer was the QueryPerformanceCounter Windows API function. But during my research I found that the System.Diagnostics.Stopwatch already uses QueryPerformanceCounter, so I thought I’d give it a try.

In my previous set of tests in order to reduce fragmentation I added a Thread.Sleep(1) to prevent generating values at a rate of more than 1/millisecond. Using Stopwatch, I was able to remove the call to Thread.Sleep and had all the generated values were sequential.

Here’s the new implementation:

public class Comb
{
    private static Stopwatch perfTimer;
    private static long startTime;

    static Comb()
    {
        perfTimer = new Stopwatch();
        startTime = DateTime.UtcNow.Ticks;
        perfTimer.Start();
    }

    private static long GetTicks()
    {
        return startTime + perfTimer.ElapsedTicks;
    }

    /// <summary>
    /// Creates a new sequential guid (aka comb) <see cref="http://www.informit.com/articles/article.aspx?p=25862&seqNum=7"/>.
    /// </summary>
    /// <remarks>A comb provides the benefits of a standard Guid w/o the database performance problems.</remarks>
    /// <returns>A new sequential guid (comb).</returns>
    public static Guid NewComb()
    {
        byte[] uid = Guid.NewGuid().ToByteArray();
        byte[] binDate = BitConverter.GetBytes(GetTicks()); // use UTC now to prevent conflicts w/ date time savings

        // create comb in SQL Server sort order
        byte[] comb = new byte[uid.Length];

        // the first 7 bytes are random - if two combs
        // are generated at the same point in time
        // they are not guaranteed to be sequential.
        // But for every DateTime.Tick there are
        // 72,057,594,037,927,935 unique possibilities so
        // there shouldn't be any collisions
        comb[3] = uid[0];
        comb[2] = uid[1];
        comb[1] = uid[2];
        comb[0] = uid[3];
        comb[5] = uid[4];
        comb[4] = uid[5];
        comb[7] = uid[6];

        // set the first 'nibble of the 7th byte to '1100' so 
        // later we can validate it was generated by us
        comb[6] = (byte)(0xc0 | (0xf & uid[7]));

        // the last 8 bytes are sequential,
        // these will reduce index fragmentation
        // to a degree as long as there are not a large
        // number of Combs generated per millisecond
        comb[9] = binDate[0];
        comb[8] = binDate[1];
        comb[15] = binDate[2];
        comb[14] = binDate[3];
        comb[13] = binDate[4];
        comb[12] = binDate[5];
        comb[11] = binDate[6];
        comb[10] = binDate[7];

        return new Guid(comb);
    }

    /// <summary>
    /// Validates if comb was generated by this class
    /// </summary>
    /// <remarks>
    /// Guids generated by Guid.NewGuid() have a value of 
    /// 0100 for the first 4 bits of the 7th byte. Ours will
    /// have a value of 1100 for the 6th byte. We're checking that here.
    /// 
    /// We could do additional validation by verifying that
    /// the value of a new Guid is greater than the
    /// one being validated (or that the last 6 bytes
    /// resolve to a valid DateTime), but this should
    /// be enough for now.
    /// </remarks>
    public static bool IsComb(Guid value)
    {
        // get the 7th byte
        byte b = value.ToByteArray()[6];

        // make sure the first 'nibble' == 1100
        return (0xc0 & b) == 0xc0;
    }

    /// <summary>
    /// Validates Guid to determine the supplied
    /// value was generated by Comb.NewComb. If
    /// invalid an ArgumentException is thrown.
    /// </summary>
    /// <param name="value"></param>
    public static void ValidateComb(Guid value)
    {
        if (!Comb.IsComb(value))
            throw new ArgumentException("The supplied Id value was not generated by Comb.NewComb.");
    }
}

There is now a static constructor, two private members and a GetTicks() method. Stopwatch is not a DateTime, but a counter representing the elapsed time since the Start() method was called. So I record DateTime.UtcNow.Ticks at “startup” and then add the number of ticks from Stopwatch to the start time.

Running the test in a single thread for 100,000 records for the previous version resulted in 31% fragmentation. The new version is now down to 1.2%. Also, the records are all sequential whereas about half the records in the previous version were not indexed in the same order they were inserted.

However, using Task.Factory.StartNew on a dual core machine the fragmentation is about the same – 45%. Also the number of “mis-ordered” records is closer (60% verses 50% – about a 10% improvement).

The newer version definitely improves on the previous, but if you’re in a high-throughput environment with a large number of clients you’ll see a good amount of fragmentation still. However, the important thing to remember is that compared to standard Guid values (non-sequential) you’ll see much better performance. The reason is because your data distribution will be completely random across your indexes, while using a Comb ensures your data is always being added at the end of your index, even if values generated at the exact same moment will be fragmented. If you can afford to defrag your indexes you’ll be fine – whereas without using a Comb value your performance will still suffer.

Update: 10/29/2010

When I first wrote the NewComb method (before I came across the Sql Server sort order issue) I was using the 8th byte as a way to validate a Guid value and determine if it were generated by the NewComb method or not. During the discussion in the comments below I realized that when I changed the method to change the byte order to use the Sql Server sort order I was setting the 7th byte. This breaks the validation since, as it was pointed out to me, 1 in 16 values generated by Guid.NewGuid() could match my validation. So below is an update to fix that problem. This works because, as I said before, Guid.NewGuid(), sets the first nibble of the 8th byte to 0x40. Obviously this won’t work with Guid values generated outside of the .NET Framework. But if that is your situation, then this won’t work and you’ll need to find another means or just skip the validation altogether.

public static Guid NewComb()
{
    byte[] uid = Guid.NewGuid().ToByteArray();
    byte[] binDate = BitConverter.GetBytes(GetTicks()); // use UTC now to prevent conflicts w/ date time savings

    // create comb in SQL Server sort order
    byte[] comb = new byte[uid.Length];

    // the first 7 bytes are random - if two combs
    // are generated at the same point in time
    // they are not guaranteed to be sequential.
    // But for every DateTime.Tick there are
    // 72,057,594,037,927,935 unique possibilities so
    // there shouldn't be any collisions
    comb[0] = uid[0];
    comb[1] = uid[1];
    comb[2] = uid[2];
    comb[3] = uid[3];
    comb[4] = uid[4];
    comb[5] = uid[5];
    comb[6] = uid[6];

    // set the first 'nibble of the 8th byte to '1100' so 
    // later we can validate it was generated by us
    comb[7] = (byte)(0xc0 | (0xf & uid[7]));

    // the last 8 bytes are sequential,
    // these will reduce index fragmentation
    // to a degree as long as there are not a large
    // number of Combs generated per millisecond
    comb[9] = binDate[0];
    comb[8] = binDate[1];
    comb[15] = binDate[2];
    comb[14] = binDate[3];
    comb[13] = binDate[4];
    comb[12] = binDate[5];
    comb[11] = binDate[6];
    comb[10] = binDate[7];

    return new Guid(comb);
}

public static bool IsComb(Guid value)
{
    // get the 8th byte
    byte b = value.ToByteArray()[7];

    // make sure the first 'nibble' == 1100
    return (0xc0 & b) == 0xc0;
}

Update: 6/9/2011

By eliminating the Stopwatch (as suggested by Augi) and optimizing the steps to copy the arrays to the new Guid I’ve been able to reduce the time it takes to generate a new Comb by about 66%. The old version takes 22-25 milliseconds to generate 10,000 Combs, while the new version takes 6-9 milliseconds on my Phenom X4 with 4GB of RAM. Here’s the modified version:

public class Comb
{
    private static int counter;

    private static long GetTicks()
    {
        int i = Interlocked.Increment(ref counter);
        // use UTC now to prevent conflicts w/ daylight savings
        return DateTime.UtcNow.Ticks + i;
    }

    /// <summary>
    /// Creates a new sequential guid (aka comb) <see cref="http://www.informit.com/articles/article.aspx?p=25862&seqNum=7"/>.
    /// </summary>
    /// <remarks>A comb provides the benefits of a standard Guid w/o the database performance problems.</remarks>
    /// <returns>A new sequential guid (comb).</returns>
    public static Guid NewComb()
    {
        byte[] uid = Guid.NewGuid().ToByteArray();
        byte[] binDate = BitConverter.GetBytes(GetTicks()); 

        // create comb in SQL Server sort order
        return new Guid(new[]{
            // the first 7 bytes are random - if two combs
            // are generated at the same point in time
            // they are not guaranteed to be sequential.
            // But for every DateTime.Tick there are
            // 72,057,594,037,927,935 unique possibilities so
            // there shouldn't be any collisions                
            uid[0],
            uid[1],
            uid[2],
            uid[3],
            uid[4],
            uid[5],
            uid[6],

            // set the first 'nibble of the 7th byte to '1100' so 
            // later we can validate it was generated by us
            (byte)(0xc0 | (0xf & uid[7])),

            // the last 8 bytes are sequential,
            // these will reduce index fragmentation
            // to a degree as long as there are not a large
            // number of Combs generated per millisecond                
            binDate[1],
            binDate[0],
            binDate[7],
            binDate[6],
            binDate[5],
            binDate[4],
            binDate[3],
            binDate[2]
        });
    }

    /// <summary>
    /// Validates if comb was generated by this class
    /// </summary>
    /// <remarks>
    /// Guids generated by Guid.NewGuid() have a value of 
    /// 0100 for the first 4 bits of the 7th byte. Ours will
    /// have a value of 1100 for the 6th byte. We're checking that here.
    /// 
    /// We could do additional validation by verifying that
    /// the value of a new Guid is greater than the
    /// one being validated (or that the last 6 bytes
    /// resolve to a valid DateTime), but this should
    /// be enough for now.
    /// </remarks>
    public static bool IsComb(Guid value)
    {
        // get the 7th byte
        byte b = value.ToByteArray()[7];

        // make sure the first 'nibble' == 1100
        return (0xc0 & b) == 0xc0;
    }

    /// <summary>
    /// Validates Guid to determine the supplied
    /// value was generated by Comb.NewComb. If
    /// invalid an ArgumentException is thrown.
    /// </summary>
    /// <param name="value"></param>
    public static void ValidateComb(Guid value)
    {
        if (!Comb.IsComb(value))
            throw new ArgumentException("The supplied Id value was not generated by Comb.NewComb.");
    }
}

I haven’t tested it, but it’s likely that in a distributed environment (web farm or similar) where there are a large number of value generated w/in the same “tick window” the new version is likely to have different fragmentation statistics. Still it should not be a problem for 99% of the applications out there.