Problem
I have two pandas data frames that share a few entries.
Consider the case when dataframe2 is a subset of dataframe1.
How can I retrieve the dataframe1 rows that aren’t in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
Asked by think nice things
Solution #1
The solution that is currently being used yields inaccurate results. To handle this problem appropriately, we can use a left-join from df1 to df2, making careful to get only the unique data for df2 first.
To begin, we first alter the original DataFrame to include the row containing data [3, 1].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, removing duplicates in df2 and ensuring that each row of df1 joins with precisely one row of df2. Return an extra column showing where table the row came from using the argument indicator.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Make a boolean condition like this:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
A few solutions make the same error: they simply check that each value is in each column separately, rather than all in the same row. Adding the last row, which is unique but contains data from both df2 columns, reveals the error:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution produces the same incorrect result:
df1.isin(df2.to_dict('l')).all(1)
Answered by Ted Petrou
Solution #2
One way is to save the result of an inner merge from both dfs and then pick the rows when one column’s values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another option, as you’ve discovered, is to use isin to generate NaN rows, which you can then drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
This will not work if df2 does not start the rows in the same way:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will generate the whole df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Answered by EdChum
Solution #3
Assuming that the indexes in the dataframes are consistent (and ignoring the actual col values):
df1[~df1.index.isin(df2.index)]
Answered by Dennis Golomazov
Solution #4
Isin requires the same columns and indices for a match, as previously stated. If the match should just be on row contents, converting the rows to a (Multi)Index is one technique to retrieve the mask for filtering the rows present:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If an index should be used, the set index keyword parameter attach can be used to append columns to an existing index. If the columns do not line up, column specifications might be used instead of list(df.columns).
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
may be utilized to make the indices instead, though I doubt it is more efficient.
Answered by Rune Lyngsoe
Solution #5
If you have two dataframes, df 1 and df 2, both of which have several fields (column names), and you want to identify just those entries in df 1 that are not in df 2 based on specific fields (e.g. fields x, fields y), then follow the steps below.
Step 1: Create key1 and key2 columns in df 1 and df 2, respectively.
Step 2: Combine the dataframes as given. Our desired columns are field x and field y.
Step 3. From df 1, select only the rows where key1 is not equal to key2.
Step 4: Remove keys 1 and 2.
This strategy will fix your problem and, even with large data sets, will be quick. I’ve tried it with dataframes with over a million rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
Answered by Pragalbh kulshrestha
Post is based on https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe