Coder Perfect

What is the best way to separate a list depending on a condition?

Problem

What’s the best approach to break a list of things into various lists based on a conditional, both aesthetically and in terms of performance? This is the same as:

good = [x for x in mylist if x in goodvals]
bad  = [x for x in mylist if x not in goodvals]

Is there a more graceful way to accomplish this?

To better illustrate what I’m trying to do, here’s the actual use case:

# files looks like: [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ... ]
IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
images = [f for f in files if f[2].lower() in IMAGE_TYPES]
anims  = [f for f in files if f[2].lower() not in IMAGE_TYPES]

Asked by Parand

Solution #1

good, bad = [], []
for x in mylist:
    (bad, good)[x in goodvals].append(x)

Answered by John La Rooy

Solution #2

That code is perfectly readable, and extremely clear!

# files looks like: [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ... ]
IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
images = [f for f in files if f[2].lower() in IMAGE_TYPES]
anims  = [f for f in files if f[2].lower() not in IMAGE_TYPES]

This is, once again, perfectly OK!

There might be slight performance improvements using sets, but it’s a trivial difference, and I find the list comprehension far easier to read, and you don’t have to worry about the order being messed up, duplicates being removed as so on.

In fact, I may go another “backward” step and simply use a for loop:

images, anims = [], []

for f in files:
    if f.lower() in IMAGE_TYPES:
        images.append(f)
    else:
        anims.append(f)

You can use a list comprehension or set() until you need to add another check or logic – for example, if you want to eliminate all 0-byte jpegs, you can add something like.

if f[1] == 0:
    continue

Answered by dbr

Solution #3

Here’s how to use a lazy iterator:

from itertools import tee

def split_on_condition(seq, condition):
    l1, l2 = tee((condition(item), item) for item in seq)
    return (i for p, i in l1 if p), (i for p, i in l2 if not p)

It evaluates the condition once per item and produces two generators, the first of which yields values from the series where the condition is true and the other of which yields values from the sequence where the condition is false.

You may use it on any iterator, even an endless one, because it’s lazy:

from itertools import count, islice

def is_prime(n):
    return n > 1 and all(n % i for i in xrange(2, n))

primes, not_primes = split_on_condition(count(), is_prime)
print("First 10 primes", list(islice(primes, 10)))
print("First 10 non-primes", list(islice(not_primes, 10)))

The non-lazy list returning approach, on the other hand, is usually preferable:

def split_on_condition(seq, condition):
    a, b = [], []
    for item in seq:
        (a if condition(item) else b).append(item)
    return a, b

Edit: Here’s a general function that performs the same thing for your more particular usecase of separating items into distinct lists based on a key:

DROP_VALUE = lambda _:_
def split_by_key(seq, resultmapping, keyfunc, default=DROP_VALUE):
    """Split a sequence into lists based on a key function.

        seq - input sequence
        resultmapping - a dictionary that maps from target lists to keys that go to that list
        keyfunc - function to calculate the key of an input value
        default - the target where items that don't have a corresponding key go, by default they are dropped
    """
    result_lists = dict((key, []) for key in resultmapping)
    appenders = dict((key, result_lists[target].append) for target, keys in resultmapping.items() for key in keys)

    if default is not DROP_VALUE:
        result_lists.setdefault(default, [])
        default_action = result_lists[default].append
    else:
        default_action = DROP_VALUE

    for item in seq:
        appenders.get(keyfunc(item), default_action)(item)

    return result_lists

Usage:

def file_extension(f):
    return f[2].lower()

split_files = split_by_key(files, {'images': IMAGE_TYPES}, keyfunc=file_extension, default='anims')
print split_files['images']
print split_files['anims']

Answered by Ants Aasma

Solution #4

The problem with all of the recommended solutions is that they will scan and filter twice. I’d create a little function that looks like this:

def split_into_two_lists(lst, f):
    a = []
    b = []
    for elem in lst:
        if f(elem):
            a.append(elem)
        else:
            b.append(elem)
    return a, b

You won’t have to process anything twice, and you won’t have to repeat code.

Answered by winden

Solution #5

This is my opinion on it. I propose a lazy, single-pass, partition function, which preserves relative order in the output subsequences.

I’m guessing the prerequisites are as follows:

My partition function (introduced below) and other similar functions have made it into a small library:

It’s generally installed using PyPI:

pip install --user split

Use the partition function to split a list based on a condition:

>>> from split import partition
>>> files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi') ]
>>> image_types = ('.jpg','.jpeg','.gif','.bmp','.png')
>>> images, other = partition(lambda f: f[-1] in image_types, files)
>>> list(images)
[('file1.jpg', 33L, '.jpg')]
>>> list(other)
[('file2.avi', 999L, '.avi')]

We need to create two subsequences at the same time internally, therefore consuming simply one output sequence will require the other to be computed as well. In addition, we must maintain state between user requests (store processed but not yet requested elements). I utilize two double-ended queues (deques) to keep track of state:

from collections import deque

The housekeeping is taken care of by the SplitSeq class:

class SplitSeq:
    def __init__(self, condition, sequence):
        self.cond = condition
        self.goods = deque([])
        self.bads = deque([])
        self.seq = iter(sequence)

Its.getNext() method is where the magic happens. It’s similar to iterators’.next(), except it lets us select which type of element we want this time. It doesn’t trash the rejected elements behind the scenes, instead placing them in one of two queues:

    def getNext(self, getGood=True):
        if getGood:
            these, those, cond = self.goods, self.bads, self.cond
        else:
            these, those, cond = self.bads, self.goods, lambda x: not self.cond(x)
        if these:
            return these.popleft()
        else:
            while 1: # exit on StopIteration
                n = self.seq.next()
                if cond(n):
                    return n
                else:
                    those.append(n)

The partition function is designed to be used by the end user. It returns two generators after taking a condition function and a sequence (much like map or filter). The first generator creates a subsequence of components that satisfy the requirement, while the second creates a complementary subsequence. Lazy splitting of even long or endless sequences is possible with iterators and generators.

def partition(condition, sequence):
    cond = condition if condition else bool  # evaluate as bool if condition == None
    ss = SplitSeq(cond, sequence)
    def goods():
        while 1:
            yield ss.getNext(getGood=True)
    def bads():
        while 1:
            yield ss.getNext(getGood=False)
    return goods(), bads()

To make partial application easier in the future, I made the test function the first argument (similar to how map and filter have the test function as the first argument).

Answered by sastanin

Post is based on https://stackoverflow.com/questions/949098/how-to-split-a-list-based-on-a-condition