Problem
In each domain, I need to count the number of unique ID values.
I have data:
ID, domain
123, 'vk.com'
123, 'vk.com'
123, 'twitter.com'
456, 'vk.com'
456, 'facebook.com'
456, 'vk.com'
456, 'google.com'
789, 'twitter.com'
789, 'vk.com'
I try df.groupby([‘domain’, ‘ID’]). count()
But I’d like to get something.
domain, count
vk.com 3
twitter.com 2
facebook.com 1
google.com 1
Asked by Arseniy Krupenin
Solution #1
You need nunique:
df = df.groupby('domain')['ID'].nunique()
print (df)
domain
'facebook.com' 1
'google.com' 1
'twitter.com' 2
'vk.com' 3
Name: ID, dtype: int64
If you need to remove’characters, do do as follows:
df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64
Or, as Jon Clements put it,
df.groupby(df.domain.str.strip("'"))['ID'].nunique()
You can keep the column name as follows:
df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
domain ID
0 fb 1
1 ggl 1
2 twitter 2
3 vk 3
The distinction between nunique() and agg() is that nunique() returns a Series, while agg() returns a DataFrame.
Answered by jezrael
Solution #2
In general, you can use Series.value counts to count distinct values in a single column:
df.domain.value_counts()
#'vk.com' 5
#'twitter.com' 2
#'facebook.com' 1
#'google.com' 1
#Name: domain, dtype: int64
Use Series.nunique to count the number of unique values in a column:
df.domain.nunique()
# 4
You can use unique or drop duplicates to acquire all of these distinct values; the only difference between the two procedures is that unique returns a numpy. drop duplicates returns a pandas, while array returns an array. Series:
df.domain.unique()
# array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)
df.domain.drop_duplicates()
#0 'vk.com'
#2 'twitter.com'
#4 'facebook.com'
#6 'google.com'
#Name: domain, dtype: object
In this case, as you want to count distinct values in relation to another variable, instead of using the groupby approach suggested by other replies, you may simply delete duplicates first and then use value counts():
import pandas as pd
df.drop_duplicates().domain.value_counts()
# 'vk.com' 3
# 'twitter.com' 2
# 'facebook.com' 1
# 'google.com' 1
# Name: domain, dtype: int64
Answered by Psidom
Solution #3
df.domain.value_counts()
>>> df.domain.value_counts()
vk.com 5
twitter.com 2
google.com 1
facebook.com 1
Name: domain, dtype: int64
Answered by kamran kausar
Solution #4
If I understand correctly, you want to know how many distinct IDs each domain has. Then give this a shot:
output = df.drop_duplicates()
output.groupby('domain').size()
Output:
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
dtype: int64
Value counts is another option, however it is significantly less efficient. But Jezrael’s nunique response is the best:
%timeit df.drop_duplicates().groupby('domain').size()
1000 loops, best of 3: 939 µs per loop
%timeit df.drop_duplicates().domain.value_counts()
1000 loops, best of 3: 1.1 ms per loop
%timeit df.groupby('domain')['ID'].nunique()
1000 loops, best of 3: 440 µs per loop
Answered by ysearka
Post is based on https://stackoverflow.com/questions/38309729/count-unique-values-per-groups-with-pandas