Coder Perfect

Pandas returns rows that do not appear in any other dataframe.

Problem

I have two pandas data frames that share a few entries.

Consider the case when dataframe2 is a subset of dataframe1.

How can I retrieve the dataframe1 rows that aren’t in dataframe2?

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Expected result:

   col1  col2
3     4    13
4     5    14

Asked by think nice things

Solution #1

The solution that is currently being used yields inaccurate results. To handle this problem appropriately, we can use a left-join from df1 to df2, making careful to get only the unique data for df2 first.

To begin, we first alter the original DataFrame to include the row containing data [3, 1].

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, removing duplicates in df2 and ensuring that each row of df1 joins with precisely one row of df2. Return an extra column showing where table the row came from using the argument indicator.

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Make a boolean condition like this:

df_all['_merge'] == 'left_only'

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

A few solutions make the same error: they simply check that each value is in each column separately, rather than all in the same row. Adding the last row, which is unique but contains data from both df2 columns, reveals the error:

common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution produces the same incorrect result:

df1.isin(df2.to_dict('l')).all(1)

Answered by Ted Petrou

Solution #2

One way is to save the result of an inner merge from both dfs and then pick the rows when one column’s values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

EDIT

Another option, as you’ve discovered, is to use isin to generate NaN rows, which you can then drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

This will not work if df2 does not start the rows in the same way:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will generate the whole df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

Answered by EdChum

Solution #3

Assuming that the indexes in the dataframes are consistent (and ignoring the actual col values):

df1[~df1.index.isin(df2.index)]

Answered by Dennis Golomazov

Solution #4

Isin requires the same columns and indices for a match, as previously stated. If the match should just be on row contents, converting the rows to a (Multi)Index is one technique to retrieve the mask for filtering the rows present:

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
   col1  col2
1     2    11
4     5    14
5     3    10

If an index should be used, the set index keyword parameter attach can be used to append columns to an existing index. If the columns do not line up, column specifications might be used instead of list(df.columns).

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

may be utilized to make the indices instead, though I doubt it is more efficient.

Answered by Rune Lyngsoe

Solution #5

If you have two dataframes, df 1 and df 2, both of which have several fields (column names), and you want to identify just those entries in df 1 that are not in df 2 based on specific fields (e.g. fields x, fields y), then follow the steps below.

Step 1: Create key1 and key2 columns in df 1 and df 2, respectively.

Step 2: Combine the dataframes as given. Our desired columns are field x and field y.

Step 3. From df 1, select only the rows where key1 is not equal to key2.

Step 4: Remove keys 1 and 2.

This strategy will fix your problem and, even with large data sets, will be quick. I’ve tried it with dataframes with over a million rows.

df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)

Answered by Pragalbh kulshrestha

Post is based on https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe