Problem
I’ve been playing around with Redis and MongoDB lately, and it appears that storing an array of ids in either MongoDB or Redis is a common scenario. Because I’m asking about the MySQL IN operator, I’ll stick with Redis for this inquiry.
I was curious about the performance of listing a large number of ids (300-3000) inside the IN operator, which would look like this:
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
Consider a simple products and categories table, which you would ordinarily JOIN together to obtain products from a specific category. In the example above, you can see that I return all the product ids from the category with id 4 from Redis (category:4:product ids) and place them in the above SELECT query inside the IN operator.
How effective is this?
Is this a case of “it depends”? Is there a specific “this is (un)acceptable” or “quick” or “slow” or should I add a LIMIT 25, or does that not help?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
LIMIT 25
Should I restrict the array of product ids supplied by Redis to 25 and just add 25 ids to the query instead of 3000 and LIMIT-ing it to 25 from within the query?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 25)
Any and all suggestions/feedback are greatly appreciated!
Asked by Michael van Rooijen
Solution #1
In general, if the IN list becomes too large (for some ill-defined value of ‘too large,’ which is normally in the region of 100 or smaller), it is more efficient to utilize a join, which may necessitate the creation of a temporary table to house the numbers.
If the numbers are dense (no gaps), as the sample data implies, then WHERE id BETWEEN 300 AND 3000 will work even better.
However, there are likely gaps in the set, in which case it might be better to stick with the list of valid values (unless the gaps are small, in which case you could use:
WHERE id BETWEEN 300 AND 3000 AND id NOT BETWEEN 742 AND 836
Alternatively, whatever the gaps are.
Answered by Jonathan Leffler
Solution #2
I’ve been running some tests, and it appears to be nicely optimized, as David Fells points out in his response. As an example, I generated an InnoDB table with 1,000,000 registers, and selecting 500,000 random numbers with the “IN” operator took just 2.5 seconds on my MAC; selecting only the even registers took 0.5 seconds.
is that in the my.cnf file, I have to increase the max allowed packet parameter. If you don’t, you’ll get a strange “MYSQL has vanished” problem.
The PHP code I used to create the test is as follows:
$NROWS =1000000;
$SELECTED = 50;
$NROWSINSERT =15000;
$dsn="mysql:host=localhost;port=8889;dbname=testschema";
$pdo = new PDO($dsn, "root", "root");
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$pdo->exec("drop table if exists `uniclau`.`testtable`");
$pdo->exec("CREATE TABLE `testtable` (
`id` INT NOT NULL ,
`text` VARCHAR(45) NULL ,
PRIMARY KEY (`id`) )");
$before = microtime(true);
$Values='';
$SelValues='(';
$c=0;
for ($i=0; $i<$NROWS; $i++) {
$r = rand(0,99);
if ($c>0) $Values .= ",";
$Values .= "( $i , 'This is value $i and r= $r')";
if ($r<$SELECTED) {
if ($SelValues!="(") $SelValues .= ",";
$SelValues .= $i;
}
$c++;
if (($c==100)||(($i==$NROWS-1)&&($c>0))) {
$pdo->exec("INSERT INTO `testtable` VALUES $Values");
$Values = "";
$c=0;
}
}
$SelValues .=')';
echo "<br>";
$after = microtime(true);
echo "Insert execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$sql = "SELECT count(*) FROM `testtable` WHERE id IN $SelValues";
$result = $pdo->prepare($sql);
$after = microtime(true);
echo "Prepare execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$result->execute();
$c = $result->fetchColumn();
$after = microtime(true);
echo "Random selection = $c Time execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$sql = "SELECT count(*) FROM `testtable` WHERE id %2 = 1";
$result = $pdo->prepare($sql);
$result->execute();
$c = $result->fetchColumn();
$after = microtime(true);
echo "Pairs = $c Exdcution time=" . ($after-$before) . "s<br>";
And the results:
Insert execution time =35.2927210331s
Prepare execution time =0.0161771774292s
Random selection = 499102 Time execution time =2.40285992622s
Pairs = 500000 Exdcution time=0.465420007706s
Answered by jbaylina
Solution #3
You can make a temporary table with any number of IDs and run nested queries on it. Example:
CREATE [TEMPORARY] TABLE tmp_IDs (`ID` INT NOT NULL,PRIMARY KEY (`ID`));
and select:
SELECT id, name, price
FROM products
WHERE id IN (SELECT ID FROM tmp_IDs);
Answered by Vladimir Jotov
Solution #4
Using IN on a huge list of records with a large parameter set will be slow.
In the case that I solved recently I had two where clauses, one with 2,50 parameters and the other with 3,500 parameters, querying a table of 40 Million records.
Using the normal WHERE IN, my query took 5 minutes to complete. I cut the query time in half by using a subquery for the IN statement (and placing the arguments in their own indexed table).
In my previous jobs, I’ve worked for both MySQL and Oracle.
Answered by yoyodunno
Solution #5
IN is satisfactory and well-optimized. You’ll be alright as long as you utilize it on an indexed field.
In terms of functionality, it’s the same as:
(x = 1 OR x = 2 OR x = 3 ... OR x = 99)
In terms of the database engine.
EDIT: Please keep in mind that this answer was written in 2011, and that the comments section of this answer discusses the most recent MySQL features.
Answered by David Fells
Post is based on https://stackoverflow.com/questions/4514697/mysql-in-operator-performance-on-large-number-of-values