The prime number challenge – great waste of time!

The prime number challenge – great waste of time!

No sane person would even consider using SQL Server to construct a list of prime numbers. So just to prove that I’m not sane (as if there could be any doubt!), this post will be about finding prime numbers.

 

First a bit of history. Ward Pond wrote about efficient ways to populate a table with one million GUIDs. I posted a comment with a slightly more efficient algorithm. And that was quickly followed by a new post from Ward, tweaking my syntax even further. And that’s when Denis the SQL Menace lived true to his name by posting this comment:

 

“How about the next challenge is to return all 78498 prime numbers between 1 and 1000000?”

 

Now of course, this is a silly challenge. Not because prime numbers are silly, mind you. They are very useful to mathematicians, and many encryption algorithms wouldn’t even be possible without (large) prime numbers. The silly part is using SQL Server, a data manipulation tool, to calculate prime numbers. If you really need them, code a quick algorithm in a C++ program. Or buy a ready-made list with the first million or so prime numbers. So I attempted to resist the challenge.

 

Alas – the flesh is weak. So when I saw Ward’s reply to Denis’ challenge, I was no longer able to resist temptation. After all, Ward’s attempt is not only interesting – it is also very long, and apparently (with an estimated completion time of 1 to 2 days!!) not very efficient. I decided that I should be able to outperform that.

 

My assumptions are that a table of numbers is already available, and that this table holds at least all numbers from 1 to 1,000,000. (Mine holds numbers from 1 to 5,764,801), and that the challenge is to create and populate a table with the prime numbers from 1 to 1,000,000, in as little speed as possible. Displaying the prime numbers is not part of the challenge. For testing purposes, I replaced the upper limit of 1,000,000 with a variable @Limit, and I set this to a mere 10,000. That saves me a lot of idle waiting time!

 

As a first attempt, I decided to play dumb. Just use one single set-based query that holds the definition of prime number. Here’s the SQL:

 

DROP TABLE dbo.Primes

go

CREATE TABLE dbo.Primes

           (Prime int NOT NULL PRIMARY KEY)

go

DECLARE @Start datetime, @End datetime

SET     @Start = CURRENT_TIMESTAMP

DECLARE @Limit int

SET     @Limit = 10000

 

INSERT INTO dbo.Primes (Prime)

SELECT      n1.Number

FROM        dbo.Numbers AS n1

WHERE       n1.Number > 1

AND         n1.Number < @Limit

AND NOT EXISTS

 (SELECT   

  FROM      dbo.Numbers AS n2

  WHERE     n2.Number > 1

  AND       n2.Number < n1.Number

  AND       n1.Number % n2.Number = 0)

 

SET     @End = CURRENT_TIMESTAMP

SELECT  @Start AS Start_time, @End AS End_time,

        DATEDIFF(ms, @Start, @End) AS Duration,

        COUNT() AS Primes_found, @Limit AS Limit

FROM    dbo.Primes

go

–select * from dbo.Primes

go

 

This ran in 1,530 ms on my test system. (And in case you ask – I also tested the equivalent query with LEFT JOIN; that took 11,466 ms, so I quickly discarded it). With @Limit set to 20,000 and 40,000, execution times were 5,263 and 18,703 ms – so each time we double @Limit, execution time grows with a factor 3.5. Using this factor, I can estimate an execution time of one to two hours.

 

That’s a lot better than the one to two days Ward estimates for his version – but not quite fast enough for me. So I decided to try to convert the Sieve of Eratosthenes to T-SQL. This algorithm is known to be both simple and fast for getting a list of prime numbers. Here’s my first attempt:

 

DROP TABLE dbo.Primes

go

CREATE TABLE dbo.Primes

           (Prime int NOT NULL PRIMARY KEY)

go

DECLARE @Start datetime, @End datetime

SET     @Start = CURRENT_TIMESTAMP

DECLARE @Limit int

SET     @Limit = 10000

 

— Initial fill of sieve;

— filter out the even numbers right from the start.

INSERT  INTO dbo.Primes (Prime)

SELECT  Number

FROM    dbo.Numbers

WHERE  (Number % 2 <> 0 OR Number = 2)

AND     Number <> 1

AND     Number <= @Limit

 

— Set @Current to 2, since multiples of 2 have already been processed

DECLARE @Current int

SET     @Current = 2

WHILE   @Current < SQRT(@Limit)

BEGIN

  — Find next prime to process

  SET @Current =

             (SELECT TOP (1) Prime

              FROM     dbo.Primes

              WHERE    Prime > @Current

              ORDER BY Prime)

  DELETE FROM dbo.Primes

  WHERE       Prime IN (SELECT n.Number @Current

                        FROM   dbo.Numbers AS n

                        WHERE  n.Number >= @Current

                        AND    n.Number <= @Limit / @Current)

END

 

SET     @End = CURRENT_TIMESTAMP

SELECT  @Start AS Start_time, @End AS End_time,

        DATEDIFF(ms, @Start, @End) AS Duration,

        COUNT() AS Primes_found, @Limit AS Limit

FROM    dbo.Primes

go

–select * from dbo.Primes

 

The time that Eratosthenes takes to find the prime numbers up to 10,000 is 7,750 ms – much longer than the previous version. My only hope was that the execution time would not increase with a factor of 3.5 when doubling @Limit – and indeed, it didn’t. With @Limit set to 20,000 and 40,000, execution times were 31,126 and 124,923 ms, so the factor has gone up to 4. With @Limit set to 1,000,000, I expect an execution time of 15 to 20 hours.

 

Time to ditch the sieve? No, not at all. Time to make use of the fact that SQL Server prefers to process whole sets at a time. Let’s look at the algorithm in more detail – after the initial INSERT that fills the sieve and removes the multiples of 2, it finds 3 as the next prime number and removes multiples of 3. It starts removing at 9 (3 squared) – so we can be pretty sure that numbers below 9 that have not yet been removed will never be removed anymore. Why process them one at a time? Why not process them all at once? That’s what the algorithm below does – on the first pass of the WHILE loop, it takes the last processed number (2), finds the first prime after that (3), then uses the square of that number (9) to define the range of numbers in the sieve that are now guaranteed to be primes. It then removes multiples of all primes in that range. And after that, it repeats the operation, this time removing multiples in the range between 11 (first prime after 9) and 121 (11 square). Let’s see how this affects performance.

 

DROP TABLE dbo.Primes

go

CREATE TABLE dbo.Primes

           (Prime int NOT NULL PRIMARY KEY)

go

DECLARE @Start datetime, @End datetime

SET     @Start = CURRENT_TIMESTAMP

DECLARE @Limit int

SET     @Limit = 10000

 

— Initial fill of sieve;

— filter out the even numbers right from the start.

INSERT  INTO dbo.Primes (Prime)

SELECT  Number

FROM    dbo.Numbers

WHERE  (Number % 2 <> 0 OR Number = 2)

AND     Number <> 1

AND     Number <= @Limit

 

— Set @Last to 2, since multiples of 2 have already been processed

DECLARE @First int, @Last int

SET     @Last = 2

WHILE   @Last < SQRT(@Limit)

BEGIN

  — Find next prime as start of next range

  SET @First =

             (SELECT TOP (1) Prime

              FROM     dbo.Primes

              WHERE    Prime > @Last

              ORDER BY Prime)

  — Range to process ends at square of starting point

  SET @Last = @First @First

  DELETE FROM dbo.Primes

  WHERE       Prime IN (SELECT     n.Number * p.Prime

                        FROM       dbo.Primes  AS p

                        INNER JOIN dbo.Numbers AS n

                              ON   n.Number >= p.Prime

                              AND  n.Number <= @Limit / p.Prime

                        WHERE      p.Prime  >= @First

                        AND        p.Prime  <  @Last)

END

 

SET     @End = CURRENT_TIMESTAMP

SELECT  @Start AS Start_time, @End AS End_time,

        DATEDIFF(ms, @Start, @End) AS Duration,

        COUNT() AS Primes_found, @Limit AS Limit

FROM    dbo.Primes

go

–select * from dbo.Primes

 

The time taken for the 1,229 primes between 1 and 10,000? A mere 266 ms!! With execution times like that, I saw no need to rely on extrapolation – I set @Limit to 1,000,000, hit the Execute button, sat back – and get the following result after 19 seconds:

 

Start_time              End_time                Duration    Primes_found Limit

———————– ———————– ———– ———— ———–

2006-09-24 00:42:22.780 2006-09-24 00:42:41.750 18970       78498        1000000

 

From 1-2 days to just under 20 seconds – and this time, I didn’t even have to add an index!

 

Finally, to top things off, I tried one more thing. I have often read that SQL Server won’t optimize an IN clause as well as an EXISTS clause, especially if the subquery after IN returns a lot of rows – which is definitely the case here. So I rewrote the DELETE statement in the heart of the WHILE loop to read like this:

 

  DELETE FROM dbo.Primes

  WHERE EXISTS

   (SELECT    *

    FROM       dbo.Primes  AS p

    INNER JOIN dbo.Numbers AS n

          ON   n.Number >= p.Prime

          AND  n.Number <= @Limit / p.Prime

    WHERE      p.Prime  >= @First

    AND        p.Prime  <  @Last

    AND        Primes.Prime = n.Number * p.Prime)

 

And here are the results:

 

Start_time              End_time                Duration    Primes_found Limit

———————– ———————– ———– ———— ———–

2006-09-24 00:47:42.797 2006-09-24 00:48:01.903 19106       78498        1000000

 

Which just goes to prove that you shouldn’t believe everything you read, I guess. <g>

Snapshot isolation: A threat for integrity? (Part 4)
Just stuff it!

Related Posts

No results found.

10 Comments. Leave new

  • Denis The SQL Menace
    September 24, 2006 00:19

    Hugo, let me know how many seconds (yes seconds) this version takes

    SET NOCOUNT ON

    DECLARE @i INT

    — Create a 10-digit table
    DECLARE @D TABLE (N INT)
    INSERT INTO @D (N)
    SELECT 0 UNION ALL
    SELECT 1 UNION ALL
    SELECT 2 UNION ALL
    SELECT 3 UNION ALL
    SELECT 4

    INSERT INTO @D (N)
    SELECT N+5 FROM @D

    — build a small sieve table between 2 and 1000
    DECLARE @T TABLE (N INT)
    INSERT INTO @T( N )
    SELECT 1+A.N+10(B.N+10C.N)
    FROM @D A, @D B, @D C

    DELETE FROM @T WHERE N = 1

    SET @I = 2
    WHILE @I <= SQRT(1000) BEGIN     DELETE FROM @T WHERE N % @I = 0 AND N > @I
       SET @I = @I + 1
    END

    — Create large table between 1001 and 1000000
    SELECT A+10(B+10(C+10(D+10(E+ 10F)))) AS N
    INTO #P
    FROM
    (    SELECT A.N AS A, B.N AS B, C.N AS C, D.N AS D, E.N AS E, F.N AS F
       FROM @D A, @D B, @D C, @D D, @D E, @D F
       WHERE A.N in (1, 3, 7, 9)  — Not divisible by 2 or 5
    ) blah
    WHERE (A+B+C+D+E+F) % 3 <> 0 — Or 3
       AND (A+3
    B+2C-D-3E-2F) % 7 <> 0 — Or 7
       AND (B-A+D-C+F-E) % 11 <> 0 — Or 11
       AND D|E|F <> 0 — Don’t include the first 1000 numbers,
    –we already have these in the small sieve table
    UNION ALL SELECT 1000000

    — sieve the big table with smaller one
    SELECT @I = 2
    WHILE @I IS NOT NULL
    BEGIN
       DELETE FROM #P WHERE N% @I = 0
       SELECT @I = MIN(N) FROM @T WHERE N > @I
    END

    — add primes up to 1000
    INSERT INTO #P SELECT N FROM @T

    — Here are the results
    –78498 rows
    SELECT  
    FROM #P ORDER BY 1

    drop table #P
    go

    Reply
  • Six seconds on my laptop, Denis..  pretty cool!

    Reply
  • Path Enumeration using Prime number products
    September 25, 2006 08:10
    Reply
  • I was curious about this section of Denis the SQL Menis’s code

    WHERE (A+B+C+D+E+F) % 3 <> 0 — Or 3
      AND (A+3B+2C-D-3E-2F) % 7 <> 0 — Or 7
      AND (B-A+D-C+F-E) % 11 <> 0 — Or 11

    Is there a mathematical reference for any of the lines?

    Reply
  • Hugo Kornelis
    October 23, 2006 22:36

    Hi Lonedog,

    I understand less than half of it myself, but I think that http://en.wikipedia.org/wiki/Divisibility_rule should be the reference you want.

    I must admit the the (A+B+C+D+E+F) % 3 <> 0 is the only part I understand <g>.

    Reply
  • This is the modulus operator, which returns the remainder of a division by the operand.  So, if one Denis’ modulus operations returns 0, the calculated expression is evenly divisible by the operand.  This is a very performant way of finding factors.

    Reply
  • What he is doing is what we used to cal "digital factoring" (before that came to mean something else), wherein he is using the decimal digits of the number to calculate "digital roots" for various primes which shortcut methods for determining their remainder modulo that prime.

    These are tricks that are used by mathemagicians, etc. to do various mental calculations.

    Reply
  • Here is what I would use.  It runs about 15-20% faster on my system than Denis’s (of course, Ive had an extra three years too):

    If Not (object_id(‘tempdb..#Sieve2’) is Null)  Drop table #Sieve2

    If Not (object_id(‘tempdb..#Sieve3’) is Null)  Drop table #Sieve3

    If Not (object_id(‘tempdb..#Candidates1M’) is Null)  Drop table #Candidates1M

    If Not (object_id(‘tempdb..#Candidates1Ma’) is Null)  Drop table #Candidates1Ma

    ;WITH Primes7to36 as (

    Select 7 as p

    Union ALL Select 11

    Union ALL Select 13

    Union ALL Select 17

    Union ALL Select 19

    Union ALL Select 23

    Union ALL Select 29

    Union ALL Select 31)

    Select p

    Into #Sieve2

    From Primes7to36

    ;WITH Base30 as (Select 1 as rem

    Union Select 7 as p

    Union ALL Select 11

    Union ALL Select 13

    Union ALL Select 17

    Union ALL Select 19

    Union ALL Select 23

    Union ALL Select 29)

    , Numbers10E4 as (Select TOP 10000

    ROW_NUMBER() Over(Order by id) as Num

    From master.sys.syscolumns)

    , Numbers34 as (Select Top 34 Num

    From Numbers10E4)

    , Candidates1000 as (

    SELECT rem+30*Num as Cand

    From Base30

    &nbsp;Cross Join Numbers34)

    , Primes7to1000 as (

    Select p From #sieve2 --Primes7to36

    &nbsp;Where p&lt;&gt;7 and p&lt;&gt;11

    UNION ALL

    Select Cand as p

    &nbsp;From Candidates1000

    &nbsp;Where Cand &lt;= 1000

    &nbsp; And Not Exists(Select *

    From #Sieve2
    
    Where (Cand % p)=0))
    

    SELECT p

    Into #Sieve3

    From Primes7to1000

    ;WITH Base30 as (Select 1 as rem

    Union Select 7 as p

    Union ALL Select 11

    Union ALL Select 13

    Union ALL Select 17

    Union ALL Select 19

    Union ALL Select 23

    Union ALL Select 29)

    , Base90 as (Select rem as rem From Base30

    Union ALL Select rem+30 From Base30

    Union ALL Select rem+60 From Base30)

    , Numbers11120 as (Select TOP 11120

    ROW_NUMBER() Over(Order by id) as Num

    From master.sys.syscolumns)

    , Candidates1M as (

    Select rem+(90*Num) as Cand

    From Base90

    &nbsp;Cross Join Numbers11120)

    , Cand2_1M as (

    Select Cand

    From Candidates1M

    Where Cand &lt;= 1000000

    &nbsp;And (Cand % 7) &lt;&gt; 0

    &nbsp;And (Cand % 11) &lt;&gt; 0

    )

    Select Cand

    Into #Candidates1M

    From Cand2_1m

    ;WITH Cand2a as (

    Select Cand as p

    From #Candidates1M

    Where Cand &lt;= 1000000

    &nbsp;And Not Exists(Select *

    From #Sieve3
    
    Where (Cand % p)=0
    
    )
    

    )

    Select p as Cand

    Into #Candidates1Ma

    From Cand2a

    ;WITH FilterPrimes as (

    Select 2 as p

    Union ALL Select 3

    Union ALL Select 5

    Union ALL Select 7

    Union ALL Select 11

    )

    , PrimesLE1M as ( Select p From FilterPrimes

    UNION ALL Select p From #Sieve3

    UNION ALL

    Select Cand as p

    &nbsp;From #Candidates1Ma

    )

    Select p

    From PrimesLE1M

    –=======

    — RBarryYoung

    Reply
  • Stan Hudecek
    March 7, 2009 13:46

    also wish to prove my (in)sanity…takes about 0.7 sec on my notebook. (no access to real server at the moment)  trying to do it all in one sql statement as well.  seemed to run faster with a pre-populated a table of ints with an index rather then the row_number, but this seems fairly fast to me.  

    select 2 [num] union select 3 union select 5 union

    select num

    from

    (

    select num,min(num%den) [div]

    from

    (

    select a.id [num], b.id [den]
    
    from 
    
    (
    
        select top 10000 row_number() over(order by id) [id]
    
        from master.sys.syscolumns 
    
        --from (select top 40000 c1.id from master.sys.syscolumns c1 cross join master.sys.syscolumns c2) [c]
    
    ) [a]
    
    join 
    
    (
    
        select top 100 row_number() over(order by id) [id]
    
        from master.sys.syscolumns          
    
    ) [b] on (
    
        a.id % 2 &amp;lt;&amp;gt; 0 
    
        and a.id % 3 &amp;lt;&amp;gt; 0
    
        and a.id % 5 &amp;lt;&amp;gt; 0 
    
        and b.id &amp;gt; 1 
    
        and b.id &amp;lt;= sqrt( a.id)
    
        ) 
    

    ) [base]

    group by num

    having min(num%den)&gt;0

    ) [cols]

    Reply
  • Hi Hugo,

    I tried your code on a million rows and it drove all 4 processors on my poor little I5 laptop into the stops.  I halted it after 10 minutes.

    The only things I can think of for such a difference between what you got for performance and what I got would be number of CPUs, type of disk, and, of course, how the Numbers table you used was built.  Here’s how I built that.  Please let me know what kind of box you were using and whether or not I correctly duplicated your Numbers table.  Thanks.

    CREATE TABLE dbo.Numbers (Number INT PRIMARY KEY CLUSTERED);

    GO

    INSERT INTO dbo.Numbers WITH(TABLOCK)

           (Number)

    SELECT TOP 5764801

           Number = ROW_NUMBER()OVER(ORDER BY (SELECT NULL))

      FROM sys.all_columns ac1

     CROSS JOIN sys.all_columns ac2

    OPTION (RECOMPILE)

    ;

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close