Coder Perfect

Create a Pandas Dataframe by appending one row at a time

Problem

I realize that Pandas is designed to load a completely populated DataFrame, but I need to start with an empty DataFrame and then fill it in one row at a time. What is the most effective method for accomplishing this?

I was able to successfully build an empty DataFrame with the following:

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I may create a new row and fill in the following field:

res = res.set_value(len(res), 'qty1', 10.0)

It works, but it’s a little strange: -/ (It fails when trying to add a string value.)

How can I add a new row (with a different column type) to my DataFrame?

Asked by PhE

Solution #1

You can use df.loc[i], where I is the index of the row in the dataframe that you specify.

>>> import pandas as pd
>>> from numpy.random import randint

>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))

>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6

Answered by fred

Solution #2

There is a significantly faster technique than adding to a data frame if you can collect all of the data for the data frame ahead of time:

I had a similar scenario where adding rows to a data frame took 30 minutes while constructing a data frame from a list of dictionaries took seconds.

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)               

Answered by ShikharDua

Solution #3

I’m concerned about performance while adding a large number of rows to a dataframe. So I put the four most popular ways to the test and measured their speed.

Runtime results (in seconds):

As a result, I do my own addition using the dictionary.

Code:

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

P.S. I believe my discovery isn’t flawless, and there may be some room for improvement.

Answered by Mikhail_Sam

Solution #4

You can use pandas.concat() or DataFrame.append to do this (). See Merge, join, and concatenate for more information and examples.

Answered by NPE

Solution #5

Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you realize why?

The following are the most crucial factors, as outlined in my previous piece.

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

The quickest approach to see how much these methods differ in terms of memory and utility is to time them.

For reference, here is the benchmarking code.

It’s posts like this that remind me why I joined this group in the first place. People recognize the significance of teaching people how to acquire the appropriate answer with the correct code rather than the correct answer with the incorrect code. You might argue that if you’re only adding a single row to your DataFrame, using loc or append isn’t a problem. People frequently turn to this topic to add more than one row – the demand is frequently to iteratively add a row inside a loop using data from a function (see related question). It is critical to recognize that iteratively increasing a DataFrame is not a good idea in this circumstance.

Answered by cs95

Post is based on https://stackoverflow.com/questions/10715965/create-a-pandas-dataframe-by-appending-one-row-at-a-time