Coder Perfect

By appending one row at a time, you can create a Pandas Dataframe.

Problem

I realize that Pandas is designed to load a completely populated DataFrame, but I need to start with an empty DataFrame and then fill it in one row at a time. What is the most effective method for accomplishing this?

I was able to successfully build an empty DataFrame with the following:

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row and fill a field with:

res = res.set_value(len(res), 'qty1', 10.0)

It works, but it’s a little strange: -/ (It fails when trying to add a string value.)

How can I add a new row (with a different column type) to my DataFrame?

Asked by PhE

Solution #1

You can use df.loc[i], where I is the index of the row in the dataframe that you specify.

>>> import pandas as pd
>>> from numpy.random import randint

>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))

>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6

Answered by fred

Solution #2

There is a significantly faster technique than adding to a data frame if you can collect all of the data for the data frame ahead of time:

I had a similar scenario where adding rows to a data frame took 30 minutes while constructing a data frame from a list of dictionaries took seconds.

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)               

Answered by ShikharDua

Solution #3

I’m concerned about performance while adding a large number of rows to a dataframe. So I tried the four most popular methods and checked their speed.

Results of the runtime (in seconds):

As a result, I do my own addition using the dictionary.

Code:

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

P.S.: I believe my realization isn’t perfect, and maybe there is some optimization that could be done.

Answered by Mikhail_Sam

Solution #4

You can use pandas.concat() or DataFrame.append to do this (). See Merge, join, and concatenate for more information and examples.

Answered by NPE

Solution #5

Yes, it has been stated that you should never grow a DataFrame, but rather add your data to a list and convert it to a DataFrame at the end. But do you realize why?

Here are the most important reasons, taken from my post here.

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

The quickest approach to see how much these methods differ in terms of memory and utility is to time them.

For reference, here is the benchmarking code.

It’s posts like this that remind me why I joined this group in the first place. People recognize the significance of teaching people how to acquire the appropriate answer with the correct code rather than the correct answer with the incorrect code. You might argue that if you’re only adding a single row to your DataFrame, using loc or append isn’t a problem. People frequently turn to this topic to add more than one row – the demand is frequently to iteratively add a row inside a loop using data from a function (see related question). It is critical to recognize that iteratively increasing a DataFrame is not a good idea in this circumstance.

Answered by cs95

Post is based on https://stackoverflow.com/questions/10715965/create-a-pandas-dataframe-by-appending-one-row-at-a-time