Problem
The documentation demonstrates how to use a dict with the output column names as keys to apply multiple functions on a groupby object at once:
In [563]: grouped['D'].agg({'result1' : np.sum,
.....: 'result2' : np.mean})
.....:
Out[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938
This, however, only applies to a Series groupby object. When a dict is supplied to a groupby DataFrame in the same way, the keys are expected to represent the column names to which the function will be applied.
I’d like to apply various functions to a number of columns (but certain columns will be operated on multiple times). In addition, some functions in the groupby object will be dependent on other columns (like sumif functions). My current method is to go column by column and use lambdas for functions that are dependent on other rows, similar to the code above. However, it is taking a long time (I think it takes a long time to iterate through a groupby object). I’ll have to tweak it so that I iterate through the entire groupby object in a single pass, but I’m wondering if there’s a built-in mechanism to do this in pandas in a tidy way.
For instance, I’ve attempted something similar to
grouped.agg({'C_sum' : lambda x: x['C'].sum(),
'C_std': lambda x: x['C'].std(),
'D_sum' : lambda x: x['D'].sum()},
'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)
However, I get a KeyError as predicted (since the keys have to be a column if agg is called from a DataFrame).
Is there any built in way to do what I’d like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?
Asked by beardc
Solution #1
The present acceptable answer’s second half is out of date and contains two deprecations. First and foremost, you can no longer use the agg groupby method to pass a dictionary of dictionaries. Second,.ix should never be used.
If you want to work with two different columns at the same time, you should use the apply method, which passes a DataFrame to the applied function implicitly. Let’s utilize a dataframe that’s comparable to the one we used earlier.
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
An aggregate can still be performed using a dictionary that maps column names to aggregation functions.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don’t like the name of the lambda column, you can use a regular function and give it a custom name using the special __name__ property, as seen below:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply, the complete group is supplied into the method as a DataFrame.
Making a single custom function that returns a Series of all the aggregations is what I advocate. As labels for the new columns, use the Series index:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you’re a big fan of MultiIndexes, you can still use one like this to return a Series:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
Answered by Ted Petrou
Solution #2
You can pass a dict of column names for keys and a list of functions for values in the first part:
In [28]: df
Out[28]:
A B C D E GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1
In [26]: f = {'A':['sum','mean'], 'B':['prod']}
In [27]: df.groupby('GRP').agg(f)
Out[27]:
A B
sum mean prod
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060
UPDATE 1:
Because the aggregate function is based on Series, references to the names of the other columns are lost. To get around this, you can use the lambda function to reference the entire dataframe and index it using the group indices.
Here’s an ingenious workaround:
In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}
In [69]: df.groupby('GRP').agg(f)
Out[69]:
A B D
sum mean prod <lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901
The resulting ‘D’ column contains the total of the ‘E’ values.
UPDATE 2:
Here’s a way that, in my opinion, will fulfill all of your requirements. Make a custom lambda function first. The letter g refers to the group. g will be a Series when aggregating. The current group from df is selected when g.index is sent to df.ix[]. After that, I check to see if column C is less than 0.5. The boolean series that is returned is sent to g[], which chooses just the rows that fulfill the criterion.
In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()
In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
In [97]: df.groupby('GRP').agg(f)
Out[97]:
A B D
sum mean prod my name
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441
Answered by Zelazny7
Solution #3
We’ve been moving away from dictionary-based aggregation and renaming in favor of named aggregations that accept a tuple from pandas version 0.25.0 or higher. Now we can aggregate and rename the column to something more descriptive:
Example:
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
a b c d group
0 0.521279 0.914988 0.054057 0.125668 0
1 0.426058 0.828890 0.784093 0.446211 0
2 0.363136 0.843751 0.184967 0.467351 1
3 0.241012 0.470053 0.358018 0.525032 1
With named aggregation, use GroupBy.agg:
df.groupby('group').agg(
a_sum=('a', 'sum'),
a_mean=('a', 'mean'),
b_mean=('b', 'mean'),
c_sum=('c', 'sum'),
d_range=('d', lambda x: x.max() - x.min())
)
a_sum a_mean b_mean c_sum d_range
group
0 0.947337 0.473668 0.871939 0.838150 0.320543
1 0.604149 0.302074 0.656902 0.542985 0.057681
Answered by Erfan
Solution #4
I found that a little more condensed listing was preferable to Ted Petrou’s answer (primarily for aesthetic reasons). Please don’t accept it; it’s just a lot more extensive comment on Ted’s response, as well as code and data. Although Python/pandas is not my first/favorite programming language, I thought this to be a good read:
df.groupby('group') \
.apply(lambda x: pd.Series({
'a_sum' : x['a'].sum(),
'a_max' : x['a'].max(),
'b_mean' : x['b'].mean(),
'c_d_prodsum' : (x['c'] * x['d']).sum()
})
)
a_sum a_max b_mean c_d_prodsum
group
0 0.530559 0.374540 0.553354 0.488525
1 1.433558 0.832443 0.460206 0.053313
It reminds me of dplyr pipes and data.table chained commands more than anything else. They aren’t always better, but they are more recognizable to me. (I realize the power of utilizing more formalized def functions for these types of operations, as well as the desire of many.) This is merely an alternative; it is not always superior.)
I generated data in the same way Ted did, but I’ll add a seed for consistency.
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.374540 0.950714 0.731994 0.598658 0
1 0.156019 0.155995 0.058084 0.866176 0
2 0.601115 0.708073 0.020584 0.969910 1
3 0.832443 0.212339 0.181825 0.183405 1
Answered by r2evans
Solution #5
In version 0.25.0, there is a new feature.
Pandas accepts the special syntax known as “named aggregation” in GroupBy.agg() to support column-specific aggregation with control over the output column names.
>>> animals = pd.DataFrame({
... 'kind': ['cat', 'dog', 'cat', 'dog'],
... 'height': [9.1, 6.0, 9.5, 34.0],
... 'weight': [7.9, 7.5, 9.9, 198.0]
... })
>>> print(animals)
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=pd.NamedAgg(column='height', aggfunc='min'),
... max_height=pd.NamedAgg(column='height', aggfunc='max'),
... average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
pandas. A namedtuple is all that NamedAgg is. Plain tuples are also acceptable.
>>> print(
... animals
... .groupby('kind')
... .agg(
... min_height=('height', 'min'),
... max_height=('height', 'max'),
... average_weight=('weight', np.mean),
... )
... )
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
The aggregation functions do not accept any additional keyword parameters. **kwargs should only contain pairs of (column, aggfunc). If your aggregation functions require additional arguments, use functools to partially apply them. partial().
Series groupby aggregations can also use named aggregation. There is no column selection in this example, thus the values are just the functions.
>>> print(
... animals
... .groupby('kind')
... .height
... .agg(
... min_height='min',
... max_height='max',
... )
... )
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
Answered by exan
Post is based on https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns