Coder Perfect

How do I add a new column to a DataFrame that already exists?

Problem

I have the following indexed DataFrame with named columns and rows not- continuous numbers:

          a         b         c         d
2  0.671399  0.101208 -0.181532  0.241273
3  0.446172 -0.243316  0.051767  1.577318
5  0.614758  0.075793 -0.451460 -0.012493

I would like to add a new column, ‘e’, to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).

0   -0.335485
1   -1.166658
2   -0.385571
dtype: float64

I’m not sure how to add column e to the above example.

Asked by tomasz74

Solution #1

Edit 2017

As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:

df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)

2015 rewrite With this code, several people have reported obtaining the SettingWithCopyWarning error. However, with the current pandas version 0.16.1, the code still works correctly.

>>> sLength = len(df1['a'])
>>> df1
          a         b         c         d
6 -0.269221 -0.026476  0.997517  1.294385
8  0.917438  0.847941  0.034235 -0.448948

>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e
6 -0.269221 -0.026476  0.997517  1.294385  1.757167
8  0.917438  0.847941  0.034235 -0.448948  2.228131

>>> pd.version.short_version
'0.16.1'

SettingWithCopyWarning alerts you to a potentially invalid assignment on a Dataframe copy. It doesn’t necessary mean you did something incorrect (false positives are possible), but starting with 0.13.0, it informed you that there are more appropriate techniques for the same objective. Then, if you receive a warning, simply follow its instructions: Use instead. loc instead of [row index,col indexer] = value

>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e         f
6 -0.269221 -0.026476  0.997517  1.294385  1.757167 -0.050927
8  0.917438  0.847941  0.034235 -0.448948  2.228131  0.006109
>>> 

In reality, as mentioned in the pandas docs, this is now the most efficient method.

Original answer:

To make the series, start with the original df1 indexes:

df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)

Answered by joaquin

Solution #2

This is a quick approach to create a new column: df[‘e’] = e df[‘e’] = e df[‘e’

Answered by Kathirmani Sukumar

Solution #3

I assume that the index values in e match those in df1.

The simplest way to create a new column named e and apply data from your series to it is to:

df['e'] = e.values

assign (Pandas 0.16.0+)

As of Pandas 0.16.0, you may also use assign, which adds additional columns to a DataFrame and produces a new object (a duplicate) that includes all of the original columns as well as the new ones.

df1 = df1.assign(e=e.values)

You can even include more than one column, as shown in this example (which also includes the assign function’s source code):

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
   a  b  mean_a  mean_b
0  1  3     1.5     3.5
1  2  4     1.5     3.5

In relation to your example:

np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))

>>> df1
          a         b         c         d
0  1.764052  0.400157  0.978738  2.240893
2 -0.103219  0.410599  0.144044  1.454274
3  0.761038  0.121675  0.443863  0.333674
7  1.532779  1.469359  0.154947  0.378163
9  1.230291  1.202380 -0.387327 -0.302303

>>> e
0   -1.048553
1   -1.420018
2   -1.706270
3    1.950775
4   -0.509652
dtype: float64

df1 = df1.assign(e=e.values)

>>> df1
          a         b         c         d         e
0  1.764052  0.400157  0.978738  2.240893 -1.048553
2 -0.103219  0.410599  0.144044  1.454274 -1.420018
3  0.761038  0.121675  0.443863  0.333674 -1.706270
7  1.532779  1.469359  0.154947  0.378163  1.950775
9  1.230291  1.202380 -0.387327 -0.302303 -0.509652

This is where you can find a description of this new feature from when it was initially launched.

Answered by Alexander

Solution #4

An ordered dict of columns is used to implement a pandas dataframe.

This means that not only can you use __getitem__ [] to get a specific column, but you can also use __setitem__ [] = to assign a new column.

The [] accessor, for example, can be used to add a column to this dataframe.

    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

It’s worth noting that this works even if the dataframe’s index is incorrect.

df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

If you have a pd.Series and try to assign it to a dataframe with missing indexes, you will have problems. Consider the following scenario:

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

This is because a pd.Series by default has an index enumerated from 0 to n. The pandas [] = method, on the other hand, aims to be “clever.”

Pandas discreetly does an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series when you use the [] = method. series = df[‘column’]

Because the []= method tries to do a number of different things based on the input, and the conclusion can’t be predicted unless you know how pandas works, this quickly generates cognitive dissonance. As a result, I would advise not using []= in code bases, although it is acceptable while reviewing data in a notebook.

If you have a pd.Series that you want to assign from top to bottom, or if you’re writing productive code and aren’t sure of the index order, it’s worth it to be safe.

This will work if you downcast the pd.Series to a np.ndarray or a list.

df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values

or

df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))

But this is not very explicit.

“Hey, this looks redundant, I’ll simply optimize this away,” a developer would suggest.

It is explicit to set the index of the pd.Series to the index of the df.

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)

Or, more realistically, you probably already have a pd.Series.

protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes

It is now possible to assign

df['protected'] = protected_series

    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.

df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

While df.assign makes it clearer what you’re doing, it still has the same issues as the previous []=

df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

Just make sure your column isn’t called self when using df.assign. It will result in errors. This makes df.assign smelly, since there are these kind of artifacts in the function.

df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'

“Well, I’ll just not use self then,” you might say. But who knows how this function will evolve to support more arguments in the future. Perhaps your column name will become an argument in a future pandas version, causing upgrade issues.

Answered by firelynx

Solution #5

It appears that using df.assign: in latest Pandas versions is the way to go.

df1 = df1.assign(e=np.random.randn(sLength))

SettingWithCopyWarning is not generated.

Answered by Mikhail Korobov

Post is based on https://stackoverflow.com/questions/12555323/how-to-add-a-new-column-to-an-existing-dataframe