Problem
I have the following indexed DataFrame with named columns and rows not- continuous numbers:
a b c d
2 0.671399 0.101208 -0.181532 0.241273
3 0.446172 -0.243316 0.051767 1.577318
5 0.614758 0.075793 -0.451460 -0.012493
I would like to add a new column, ‘e’, to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).
0 -0.335485
1 -1.166658
2 -0.385571
dtype: float64
I’m not sure how to add column e to the above example.
Asked by tomasz74
Solution #1
Edit 2017
As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:
df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)
2015 rewrite With this code, several people have reported obtaining the SettingWithCopyWarning error. However, with the current pandas version 0.16.1, the code still works correctly.
>>> sLength = len(df1['a'])
>>> df1
a b c d
6 -0.269221 -0.026476 0.997517 1.294385
8 0.917438 0.847941 0.034235 -0.448948
>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e
6 -0.269221 -0.026476 0.997517 1.294385 1.757167
8 0.917438 0.847941 0.034235 -0.448948 2.228131
>>> pd.version.short_version
'0.16.1'
SettingWithCopyWarning alerts you to a potentially invalid assignment on a Dataframe copy. It doesn’t necessary mean you did something incorrect (false positives are possible), but starting with 0.13.0, it informed you that there are more appropriate techniques for the same objective. Then, if you receive a warning, simply follow its instructions: Use instead. loc instead of [row index,col indexer] = value
>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e f
6 -0.269221 -0.026476 0.997517 1.294385 1.757167 -0.050927
8 0.917438 0.847941 0.034235 -0.448948 2.228131 0.006109
>>>
In reality, as mentioned in the pandas docs, this is now the most efficient method.
Original answer:
To make the series, start with the original df1 indexes:
df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
Answered by joaquin
Solution #2
This is a quick approach to create a new column: df[‘e’] = e df[‘e’] = e df[‘e’
Answered by Kathirmani Sukumar
Solution #3
I assume that the index values in e match those in df1.
The simplest way to create a new column named e and apply data from your series to it is to:
df['e'] = e.values
assign (Pandas 0.16.0+)
As of Pandas 0.16.0, you may also use assign, which adds additional columns to a DataFrame and produces a new object (a duplicate) that includes all of the original columns as well as the new ones.
df1 = df1.assign(e=e.values)
You can even include more than one column, as shown in this example (which also includes the assign function’s source code):
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
a b mean_a mean_b
0 1 3 1.5 3.5
1 2 4 1.5 3.5
In relation to your example:
np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))
>>> df1
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
7 1.532779 1.469359 0.154947 0.378163
9 1.230291 1.202380 -0.387327 -0.302303
>>> e
0 -1.048553
1 -1.420018
2 -1.706270
3 1.950775
4 -0.509652
dtype: float64
df1 = df1.assign(e=e.values)
>>> df1
a b c d e
0 1.764052 0.400157 0.978738 2.240893 -1.048553
2 -0.103219 0.410599 0.144044 1.454274 -1.420018
3 0.761038 0.121675 0.443863 0.333674 -1.706270
7 1.532779 1.469359 0.154947 0.378163 1.950775
9 1.230291 1.202380 -0.387327 -0.302303 -0.509652
This is where you can find a description of this new feature from when it was initially launched.
Answered by Alexander
Solution #4
An ordered dict of columns is used to implement a pandas dataframe.
This means that not only can you use __getitem__ [] to get a specific column, but you can also use __setitem__ [] = to assign a new column.
The [] accessor, for example, can be used to add a column to this dataframe.
size name color
0 big rose red
1 small violet blue
2 small tulip red
3 small harebell blue
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
It’s worth noting that this works even if the dataframe’s index is incorrect.
df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
If you have a pd.Series and try to assign it to a dataframe with missing indexes, you will have problems. Consider the following scenario:
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
This is because a pd.Series by default has an index enumerated from 0 to n. The pandas [] = method, on the other hand, aims to be “clever.”
Pandas discreetly does an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series when you use the [] = method. series = df[‘column’]
Because the []= method tries to do a number of different things based on the input, and the conclusion can’t be predicted unless you know how pandas works, this quickly generates cognitive dissonance. As a result, I would advise not using []= in code bases, although it is acceptable while reviewing data in a notebook.
If you have a pd.Series that you want to assign from top to bottom, or if you’re writing productive code and aren’t sure of the index order, it’s worth it to be safe.
This will work if you downcast the pd.Series to a np.ndarray or a list.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values
or
df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))
But this is not very explicit.
“Hey, this looks redundant, I’ll simply optimize this away,” a developer would suggest.
It is explicit to set the index of the pd.Series to the index of the df.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)
Or, more realistically, you probably already have a pd.Series.
protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index
3 no
2 no
1 no
0 yes
It is now possible to assign
df['protected'] = protected_series
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.
df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
While df.assign makes it clearer what you’re doing, it still has the same issues as the previous []=
df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
Just make sure your column isn’t called self when using df.assign. It will result in errors. This makes df.assign smelly, since there are these kind of artifacts in the function.
df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'
“Well, I’ll just not use self then,” you might say. But who knows how this function will evolve to support more arguments in the future. Perhaps your column name will become an argument in a future pandas version, causing upgrade issues.
Answered by firelynx
Solution #5
It appears that using df.assign: in latest Pandas versions is the way to go.
df1 = df1.assign(e=np.random.randn(sLength))
SettingWithCopyWarning is not generated.
Answered by Mikhail Korobov
Post is based on https://stackoverflow.com/questions/12555323/how-to-add-a-new-column-to-an-existing-dataframe