Coder Perfect

Select n random rows from SQL Server table

Problem

I have a SQL Server table with approximately 50,000 rows. I’d like to choose 5,000 of those rows at random. I’ve considered a more involved approach, such as creating a temp table with a “random number” column, copying my table into it, looping over the temp table and updating each row with RAND(), and then picking rows where the random number column is less than 0.1. I’m searching for an easier way to do that, preferably in a single statement.

The NEWID() function is recommended in this article. That appears promising, but I’m not sure how I’d be able to select a specific % of rows with any certainty.

Has anyone tried this before? Do you have any suggestions?

Asked by John M Gant

Solution #1

select top 10 percent * from [yourtable] order by newid()

In response to the comment about huge tables being “pure trash,” you might do it this way to increase performance.

select  * from [yourtable] where [yourPk] in 
(select top 10 percent [yourPk] from [yourtable] order by newid())

The cost will be the key scan of values plus the join cost, which should be affordable on a large table with a small percentage selection.

Answered by Ralph Shillington

Solution #2

TABLESAMPLE will provide you with almost as random and better performance, depending on your demands. This feature is available in Microsoft SQL Server 2005 and subsequent versions.

TABLESAMPLE will deliver data from random pages rather than random rows, thus it will not even retrieve data it won’t return.

I experimented on a fairly huge table.

select top 1 percent * from [tablename] order by newid()

it took more than 20 minutes

select * from [tablename] tablesample(1 percent)

took 2 minutes.

TABLESAMPLE’s performance will also improve on smaller samples, however newid’s performance will not ().

Please note that this approach is not as random as the newid() method, but it will provide a good sample.

See the MSDN page for further information.

Answered by Patrick Taylor

Solution #3

Because it must generate an id for each row and then sort them, newid()/order by will be exceedingly expensive for huge result sets.

TABLESAMPLE() is fast, but it causes results to clump together (all rows on a page will be returned).

The easiest technique to get a better real random sample is to filter out rows randomly. In the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE, I found the following code sample:

Here are my results when applied to a table with 1,000,000 rows:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

TABLESAMPLE will offer you the greatest results if you can get away with it. Use the newid()/filter method if you don’t want to use the newid()/filter method. If you have a large result set, newid()/order by should be your final resort.

Answered by Rob Boek

Solution #4

On MSDN, there’s a straightforward, well-articulated approach for selecting rows randomly from a huge table that meets the large-scale performance concerns.

  SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

Answered by Kyle McClellan

Solution #5

For tables with 1, 7, and 13 million rows, this link has an interesting comparison between Orderby(NEWID()) and other approaches.

The NEWID query is frequently suggested in discussion groups when queries about how to select random rows are addressed; it is basic and works well for tiny tables.

SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()

When it comes to huge tables, however, the NEWID query has a significant disadvantage. All of the rows in the table are copied into the tempdb database and sorted using the ORDER BY clause. This leads to two issues:

What you need is a method for selecting rows at random that does not rely on tempdb and does not become significantly slower as the database grows in size. Here’s a fresh way to go about it:

SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The primary idea behind this query is to generate a random number between 0 and 99 for each row in the table, then select those rows whose random number is less than the specified percent value. In this case, we want about 10% of the rows to be chosen at random, therefore we choose all of the rows with a random number less than 10.

Please read the entire MSDN article.

Answered by RJardines

Post is based on https://stackoverflow.com/questions/848872/select-n-random-rows-from-sql-server-table