Problem
This is a list of dictionaries I have:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
And I’d like to convert this into a pandas DataFrame that looks like this:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
It’s worth noting that the order of the columns is irrelevant.
How can I make the above-mentioned list of dictionaries into a pandas DataFrame?
Asked by appleLover
Solution #1
Assuming d is a list of dicts, type:
df = pd.DataFrame(d)
Note that nested data is not supported.
Answered by joris
Solution #2
The other responses are true, but there hasn’t been any discussion of the benefits and drawbacks of these strategies. The purpose of this post is to present examples of these strategies in various contexts, debate when to use them (and when not to), and offer alternatives.
Depending on the structure and format of your data, all three techniques may work, some may work better than others, and others may not function at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This collection is made up of “records” that contain all of the keys. This is the simplest scenario you may come upon.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Before moving forward, it’s crucial to understand the differences between different types of dictionary orientations and pandas support. “Columns” and “index” are the two most common types.
orient=’columns’ The keys of dictionaries with the “columns” orientation correspond to columns in the corresponding DataFrame.
The data above, for example, is oriented in the “columns” direction.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note that when you use pd.DataFrame.from records, the orientation is presumed to be “columns” (unless you specify otherwise), and the dictionaries are loaded appropriately.
orient=’index’ Keys are expected to correspond to index values in this orient. This type of information is best suited to pd. DataFrame.from dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case isn’t mentioned in the OP, but it’s still worth knowing about.
You can use the index=… option to create a custom index on the generated DataFrame.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
pd.DataFrame.from dict does not allow this.
When dealing with dictionaries with missing keys/column values, all techniques operate right out of the box. As an example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
“What if I don’t want to read every single column?” you might wonder. The columns=… argument makes it simple to define this.
For example, in the data2 sample dictionary above, if you just want to read columns “A,” “D,” and “F,” you can pass a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
With the default orient “columns,” pd.DataFrame.from dict does not support this.
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
None of these methods directly support it. You’ll have to go through your data, doing a reverse delete in the process. To extract only the 0th and 2nd rows from data2 above, for example, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The json normalize function, which works with lists of dictionaries (records) and can also support nested dictionaries, is a strong, robust alternative to the approaches indicated above.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Keep in mind that the data given to json normalize must be in the form of a list of dictionaries (records).
Json normalize can also handle nested dictionaries, as previously stated. The following is an example from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
Check out the documentation for more details on the meta and record path options.
Here’s a list of all the techniques mentioned above, along with the features and capabilities they support.
* To achieve the same effect as orient=’index,’ use orient=’columns’ and then transpose.
Answered by cs95
Solution #3
To get this to work in Pandas 16.2, I had to use pd.DataFrame.from records(d).
Answered by szeitlin
Solution #4
pd.DataFrame.from dict(d) can alternatively be used as:
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
Answered by shivsn
Solution #5
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])
Answered by Soum
Post is based on https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe