Problem
I have two pandas data frames that share a few entries.
Consider the case when dataframe2 is a subset of dataframe1.
How can I retrieve the dataframe1 rows that aren’t in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
Asked by think nice things
Solution #1
The solution that is currently being used yields inaccurate results. To handle this problem appropriately, we can use a left-join from df1 to df2, making careful to get only the unique data for df2 first.
To begin, we first change the original DataFrame to include the data row [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, removing duplicates in df2 and ensuring that each row of df1 joins with precisely one row of df2. Return an extra column showing where table the row came from using the argument indicator.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Make a boolean condition like this:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
A few solutions make the same error: they simply check that each value is in each column separately, rather than all in the same row. Adding the last row, which is unique but contains data from both df2 columns, reveals the error:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution produces the same incorrect result:
df1.isin(df2.to_dict('l')).all(1)
Answered by Ted Petrou
Solution #2
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column’s values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another option, as you’ve discovered, is to use isin to generate NaN rows, which you can then drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
This will not work if df2 does not start the rows in the same way:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will generate the whole df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Answered by EdChum
Solution #3
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
Answered by Dennis Golomazov
Solution #4
Isin requires the same columns and indices for a match, as previously stated. If the match should just be on row contents, converting the rows to a (Multi)Index is one technique to retrieve the mask for filtering the rows present:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
may be utilized to make the indices instead, though I doubt it is more efficient.
Answered by Rune Lyngsoe
Solution #5
If you have two dataframes, df 1 and df 2, both of which have several fields (column names), and you want to identify just those entries in df 1 that are not in df 2 based on specific fields (e.g. fields x, fields y), then follow the steps below.
Step 1: Create key1 and key2 columns in df 1 and df 2, respectively.
Step 2: Combine the dataframes as given. Our desired columns are field x and field y.
Step 3. From df 1, select only the rows where key1 is not equal to key2.
Step 4: Remove keys 1 and 2.
This strategy will fix your problem and, even with large data sets, will be quick. I’ve tried it with dataframes with over a million rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
Answered by Pragalbh kulshrestha
Post is based on https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe