Coder Perfect

How do I remove rows from a Pandas DataFrame if the value of a certain column is NaN?

Problem

I have this DataFrame and just want the entries with a non-NaN EPS column:

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

To get this dataframe, use df.drop(….) or something similar:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How can I go about doing that?

Asked by bigbug

Solution #1

Take the rows where EPS is not NA, but don’t drop them:

df = df[df['EPS'].notna()]

Answered by eumiro

Solution #2

This issue has previously been fixed, but…

…and don’t forget to think about Wouter’s original suggestion. The ability to deal with missing data, including dropna(), is expressly incorporated into pandas. These functions come with a variety of settings that may be useful, in addition to potentially enhanced performance over doing it manually.

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN
In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN
In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN
In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other choices, such as dropping columns instead of rows (see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html).

Pretty handy!

Answered by Aman

Solution #3

I know this has already been answered, but for the sake of a completely pandas solution to this specific question rather than Aman’s broad description (which was fantastic), and in case anyone else stumbles across this:

import pandas as pd
df = df[pd.notnull(df['EPS'])]

Answered by Kirk Hadley

Solution #4

This is something you can use:

df.dropna(subset=['EPS'], how='all', inplace=True)

Answered by Joe

Solution #5

The simplest solution is:

filtered_df = df[df['EPS'].notnull()]

Answered by Gil Baggio

Post is based on https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan