Problem
How can I use SQLAlchemy to select one or more random rows from a table?
Asked by cnu
Solution #1
This is primarily a database-related issue.
You can use this with SQLAlchemy because PostgreSQL, SQLite, MySQL, and Oracle all offer the ability to order by a random function:
from sqlalchemy.sql.expression import func, select
select.order_by(func.random()) # for PostgreSQL, SQLite
select.order_by(func.rand()) # for MySQL
select.order_by('dbms_random.value') # For Oracle
The query must then be limited by the number of records you require (for example, by using.limit()).
Keep in mind that picking random records, at least in PostgreSQL, has serious performance concerns; here’s a nice essay on the subject.
Answered by Łukasz
Solution #2
If you’re using the orm and your table isn’t too big (or you’ve already cached the number of rows), and you want it to be database independent, there’s a fairly simple way to do it.
import random
rand = random.randrange(0, session.query(Table).count())
row = session.query(Table)[rand]
This is a little bit of cheating, but that’s why you use an orm.
Answered by David Raznick
Solution #3
There is a simple database-independent approach to pull a random record. Simply use the.offset() method. There’s no need to remove all the rows:
import random
query = DBSession.query(Table)
rowCount = int(query.count())
randomRow = query.offset(int(rowCount*random.random())).first()
What is the location of your table? (or you could put any query there). If you want a few rows, simply repeat the process, making sure that each row is distinct from the one before it.
Answered by GuySoft
Solution #4
From slowest to fastest, here are four alternative variants. At the bottom of the page, you’ll find the timeit results:
from sqlalchemy.sql import func
from sqlalchemy.orm import load_only
def simple_random():
return random.choice(model_name.query.all())
def load_only_random():
return random.choice(model_name.query.options(load_only('id')).all())
def order_by_random():
return model_name.query.order_by(func.random()).first()
def optimized_random():
return model_name.query.options(load_only('id')).offset(
func.floor(
func.random() *
db.session.query(func.count(model_name.id))
)
).limit(1).all()
The following are the results of 10,000 timeit runs on my Macbook against a 300-row PostgreSQL table:
simple_random():
90.09954111799925
load_only_random():
65.94714171699889
order_by_random():
23.17819356000109
optimized_random():
19.87806927999918
It’s clear that using func.random() is far faster than using Python’s random.choice() method to return all results ().
Furthermore, as the table size grows, the efficiency of order by random() degrades dramatically since ORDER BY requires a full database search whereas COUNT in optimized random() can use an index.
Answered by Jeff Widman
Solution #5
Some SQL DBMS, namely Microsoft SQL Server, DB2, and PostgreSQL have implemented the SQL:2003 TABLESAMPLE clause. Support was added to SQLAlchemy in version 1.1. It allows returning a sample of a table using different sampling methods – the standard requires SYSTEM and BERNOULLI, which return a desired approximate percentage of a table.
To create a TableSample construct in SQLAlchemy, use FromClause.tablesample() and tablesample():
# Approx. 1%, using SYSTEM method
sample1 = mytable.tablesample(1)
# Approx. 1%, using BERNOULLI method
sample2 = mytable.tablesample(func.bernoulli(1))
When using mapped classes, there is one caveat: the resulting TableSample object must be aliased before it can be used to query model objects:
sample = aliased(MyModel, tablesample(MyModel, 1))
res = session.query(sample).all()
Since many of the answers contain performance benchmarks, I’ll include some simple tests here as well. Using a simple table in PostgreSQL with about a million rows and a single integer column, select (approx.) 1% sample:
In [24]: %%timeit
...: foo.select().\
...: order_by(func.random()).\
...: limit(select([func.round(func.count() * 0.01)]).
...: select_from(foo).
...: as_scalar()).\
...: execute().\
...: fetchall()
...:
307 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [25]: %timeit foo.tablesample(1).select().execute().fetchall()
6.36 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit foo.tablesample(func.bernoulli(1)).select().execute().fetchall()
19.8 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Before rushing to utilise the SYSTEM sampling method, keep in mind that it samples pages rather than individual tuples, therefore it may not be appropriate for tiny tables and may not yield as random results if the table is clustered.
If you’re using a dialect that doesn’t let you supply the sample %, number of rows, or seed as parameters, and you’re using a driver that doesn’t let you inline values, pass the values as literal SQL text if they’re static, or use a custom SQLA compiler extension to inline them:
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql import TableSample
@compiles(TableSample)
def visit_tablesample(tablesample, self, asfrom=False, **kw):
""" Compile `TableSample` with values inlined.
"""
kw_literal_binds = {**kw, "literal_binds": True}
text = "%s TABLESAMPLE %s" % (
self.visit_alias(tablesample, asfrom=True, **kw),
tablesample._get_method()._compiler_dispatch(self, **kw_literal_binds),
)
if tablesample.seed is not None:
text += " REPEATABLE (%s)" % (
tablesample.seed._compiler_dispatch(self, **kw_literal_binds)
)
return text
from sqlalchemy import table, literal, text
# Static percentage
print(table("tbl").tablesample(text("5 PERCENT")))
# Compiler inlined values
print(table("tbl").tablesample(5, seed=literal(42)))
Answered by Ilja Everilä
Post is based on https://stackoverflow.com/questions/60805/getting-random-row-through-sqlalchemy