Problem
I have a SQL Server table with approximately 50,000 rows. I’d like to choose 5,000 of those rows at random. I’ve considered a more involved approach, such as creating a temp table with a “random number” column, copying my table into it, looping over the temp table and updating each row with RAND(), and then picking rows where the random number column is less than 0.1. I’m searching for an easier way to do that, preferably in a single statement.
The NEWID() function is recommended in this article. That appears promising, but I’m not sure how I’d be able to select a specific % of rows with any certainty.
Has anyone tried this before? Do you have any suggestions?
Asked by John M Gant
Solution #1
select top 10 percent * from [yourtable] order by newid()
In response to the comment about huge tables being “pure trash,” you might do it this way to increase performance.
select * from [yourtable] where [yourPk] in
(select top 10 percent [yourPk] from [yourtable] order by newid())
The cost will be the key scan of values plus the join cost, which should be affordable on a large table with a small percentage selection.
Answered by Ralph Shillington
Solution #2
TABLESAMPLE will provide you with almost as random and better performance, depending on your demands. This feature is available in Microsoft SQL Server 2005 and subsequent versions.
TABLESAMPLE will deliver data from random pages rather than random rows, thus it will not even retrieve data it won’t return.
I experimented on a fairly huge table.
select top 1 percent * from [tablename] order by newid()
it took more than 20 minutes
select * from [tablename] tablesample(1 percent)
took 2 minutes.
TABLESAMPLE’s performance will also improve on smaller samples, however newid’s performance will not ().
Please note that this approach is not as random as the newid() method, but it will provide a good sample.
See the MSDN page for further information.
Answered by Patrick Taylor
Solution #3
Because it must generate an id for each row and then sort them, newid()/order by will be exceedingly expensive for huge result sets.
TABLESAMPLE() is fast, but it causes results to clump together (all rows on a page will be returned).
The easiest technique to get a better real random sample is to filter out rows randomly. In the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE, I found the following code sample:
Here are my results when applied to a table with 1,000,000 rows:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
TABLESAMPLE will offer you the greatest results if you can get away with it. Use the newid()/filter method if you don’t want to use the newid()/filter method. If you have a large result set, newid()/order by should be your final resort.
Answered by Rob Boek
Solution #4
On MSDN, there’s a straightforward, well-articulated approach for selecting rows randomly from a huge table that meets the large-scale performance concerns.
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
Answered by Kyle McClellan
Solution #5
For tables with 1, 7, and 13 million rows, this link has an interesting comparison between Orderby(NEWID()) and other approaches.
The NEWID query is frequently suggested in discussion groups when queries about how to select random rows are addressed; it is basic and works well for tiny tables.
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
When it comes to huge tables, however, the NEWID query has a significant disadvantage. All of the rows in the table are copied into the tempdb database and sorted using the ORDER BY clause. This leads to two issues:
What you need is a method for selecting rows at random that does not rely on tempdb and does not become significantly slower as the database grows in size. Here’s a fresh way to go about it:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The primary idea behind this query is to generate a random number between 0 and 99 for each row in the table, then select those rows whose random number is less than the specified percent value. In this case, we want about 10% of the rows to be chosen at random, therefore we choose all of the rows with a random number less than 10.
Please read the entire MSDN article.
Answered by RJardines
Post is based on https://stackoverflow.com/questions/848872/select-n-random-rows-from-sql-server-table