Coder Perfect

Using Pandas, count the number of unique values per group [duplicate].

Problem

In each domain, I need to count the number of unique ID values.

I have data:

ID, domain
123, 'vk.com'
123, 'vk.com'
123, 'twitter.com'
456, 'vk.com'
456, 'facebook.com'
456, 'vk.com'
456, 'google.com'
789, 'twitter.com'
789, 'vk.com'

I try df.groupby([‘domain’, ‘ID’]). count()

But I’d like to get something.

domain, count
vk.com   3
twitter.com   2
facebook.com   1
google.com   1

Asked by Arseniy Krupenin

Solution #1

You need nunique:

df = df.groupby('domain')['ID'].nunique()

print (df)
domain
'facebook.com'    1
'google.com'      1
'twitter.com'     2
'vk.com'          3
Name: ID, dtype: int64

If you need to remove’characters, do do as follows:

df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
Name: ID, dtype: int64

Or, as Jon Clements put it,

df.groupby(df.domain.str.strip("'"))['ID'].nunique()

You can keep the column name as follows:

df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
    domain  ID
0       fb   1
1      ggl   1
2  twitter   2
3       vk   3

The distinction between nunique() and agg() is that nunique() returns a Series, while agg() returns a DataFrame.

Answered by jezrael

Solution #2

In general, you can use Series.value counts to count distinct values in a single column:

df.domain.value_counts()

#'vk.com'          5
#'twitter.com'     2
#'facebook.com'    1
#'google.com'      1
#Name: domain, dtype: int64

Use Series.nunique to count the number of unique values in a column:

df.domain.nunique()
# 4

You can use unique or drop duplicates to acquire all of these distinct values; the only difference between the two procedures is that unique returns a numpy. drop duplicates returns a pandas, while array returns an array. Series:

df.domain.unique()
# array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)

df.domain.drop_duplicates()
#0          'vk.com'
#2     'twitter.com'
#4    'facebook.com'
#6      'google.com'
#Name: domain, dtype: object

In this case, as you want to count distinct values in relation to another variable, instead of using the groupby approach suggested by other replies, you may simply delete duplicates first and then use value counts():

import pandas as pd
df.drop_duplicates().domain.value_counts()

# 'vk.com'          3
# 'twitter.com'     2
# 'facebook.com'    1
# 'google.com'      1
# Name: domain, dtype: int64

Answered by Psidom

Solution #3

df.domain.value_counts()

>>> df.domain.value_counts()

vk.com          5

twitter.com     2

google.com      1

facebook.com    1

Name: domain, dtype: int64

Answered by kamran kausar

Solution #4

If I understand correctly, you want to know how many distinct IDs each domain has. Then give this a shot:

output = df.drop_duplicates()
output.groupby('domain').size()

Output:

    domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
dtype: int64

Value counts is another option, however it is significantly less efficient. But Jezrael’s nunique response is the best:

%timeit df.drop_duplicates().groupby('domain').size()
1000 loops, best of 3: 939 µs per loop
%timeit df.drop_duplicates().domain.value_counts()
1000 loops, best of 3: 1.1 ms per loop
%timeit df.groupby('domain')['ID'].nunique()
1000 loops, best of 3: 440 µs per loop

Answered by ysearka

Post is based on https://stackoverflow.com/questions/38309729/count-unique-values-per-groups-with-pandas