Problem
While learning pandas, I spent months trying to figure out an answer to this question. For my day-to-day job, I use SAS, which is wonderful for its out-of-core support. SAS, on the other hand, is a terrible piece of software for a variety of reasons.
I’d like to eventually replace SAS with python and pandas, however I don’t have an out-of-core approach for large datasets right now. I’m not referring to “big data,” which necessitates the use of a distributed network, but rather files that are too huge to fit in memory yet too small to fit on a hard disk.
My initial intention was to utilize HDFStore to store massive datasets on disk and then extract only the portions I need into dataframes for analysis. Others have suggested MongoDB as a more user-friendly option. My query is as follows:
What are some good workflows for completing the following tasks:
Real-world examples, particularly from anyone who works with “big data,” would be greatly appreciated.
Edit — here’s an example of how I’d like this to function:
I’m attempting to figure out how to do these tasks in the most efficient way possible. According to the resources concerning pandas and pytables, adding a new column could be a problem.
Edit — In response to Jeff’s particular questions:
It’s uncommon for me to add rows to a dataset. I’ll almost always be adding new columns (also known as variables or features in statistics and machine learning).
Asked by Zelazny7
Solution #1
I regularly utilize tens of gigabytes of data in this manner, for example. I have disk tables that I read using queries, create data in, and then append back.
It’s worth reading the docs and the last few posts in this thread for a few ideas on how to store your data.
Details that will have an impact on how you save your data, such as: Please provide as much information as possible so that I can assist you in developing a structure.
Make sure you have at least pandas 0.10.1 installed.
Read chunk-by-chunk iterating files and numerous table queries.
We’ll create a table for each group of fields because pytables is optimized for row-wise operations (which is what you query on). This way, it’s simple to select a small group of fields (which will work with a large table, but it’s more efficient… I think I’ll be able to fix this limitation in the future… this is more intuitive anyway): (The pseudocode that follows is for illustrative purposes only.)
import numpy as np
import pandas as pd
# create a store
store = pd.HDFStore('mystore.h5')
# this is the key to your storage:
# this maps your fields to a specific group, and defines
# what you want to have as data_columns.
# you might want to create a nice class wrapping this
# (as you will want to have this map and its inversion)
group_map = dict(
A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
B = dict(fields = ['field_10',...... ], dc = ['field_10']),
.....
REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),
)
group_map_inverted = dict()
for g, v in group_map.items():
group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
Reading the files and creating storage (basically the same thing as append to multiple):
for f in files:
# read in the file, additional options may be necessary here
# the chunksize is not strictly necessary, you may be able to slurp each
# file into memory in which case just eliminate this part of the loop
# (you can also change chunksize if necessary)
for chunk in pd.read_table(f, chunksize=50000):
# we are going to append to each table by group
# we are not going to create indexes at this time
# but we *ARE* going to create (some) data_columns
# figure out the field groupings
for g, v in group_map.items():
# create the frame for this group
frame = chunk.reindex(columns = v['fields'], copy = False)
# append it
store.append(g, frame, index=False, data_columns = v['dc'])
Now the file has all of the tables (you could keep them in different files if you wanted, but you’d have to include the filename to the group map, which is probably unnecessary).
This is how you receive and make new columns:
frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
# select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows
# do calculations on this frame
new_frame = cool_function_on_frame(frame)
# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
When you are ready for post_processing:
# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
You don’t have to define any data columns; they simply allow you to sub-select records based on the column. For instance, consider the following:
store.select(group, where = ['field_1000=foo', 'field_1001>0'])
They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).
You might also want to:
Please contact me if you have any questions.
Answered by Jeff
Solution #2
I believe the answers above are missing a straightforward technique that I’ve found to be really helpful.
When I have a file that is too huge to fit in memory, I split it up into smaller files (either by row or cols)
For example, if I have 30 days worth of trade data that is 30GB in size, I divide it down into 1GB files per day. After that, I process each file separately and then combine the findings at the end.
One of the most significant benefits is the ability to process files in parallel (either multiple threads or processes)
Another benefit is that file manipulation (for example, adding/removing dates) can be done with standard shell commands, which is not possible with more advanced/complicated file formats.
This method does not cover all circumstances, but it is quite valuable in a large number of them.
Answered by user1827356
Solution #3
Two years after the query, there is now a ‘out-of-core’ pandas equivalent: dask. It’s fantastic! Despite the fact that it does not support all of pandas’ features, you can get a lot done with it. Update: It has been constantly updated for the past two years, and there is a sizable user group using Dask.
Vaex now has another high-performance ‘out-of-core’ pandas equivalent, four years after the question was posed. “For best performance (no memory wasted),” it “uses memory mapping, zero memory copy policy, and lazy computations.” It can process billions of rows of data without storing them in memory (making it even possible to do analysis on suboptimal hardware).
Answered by Private
Solution #4
If your datasets are between 1 and 20 GB, a workstation with 48 GB of RAM is recommended. Pandas can then store the complete dataset in memory. I realize this isn’t the answer you’re hoping for, but doing scientific computing on a laptop with only 4GB of RAM isn’t feasible.
Answered by rjurney
Solution #5
I realize this is an old thread, but I believe the Blaze library is worth a look. It’s designed for instances like these.
From the docs:
Blaze expands NumPy and Pandas’ capabilities to distributed and out-of-core computation. Blaze provides a user interface that is similar to the NumPy ND-Array or Pandas DataFrame, but it links these familiar interfaces to a range of alternative computational engines, such as Postgres or Spark.
By the way, it’s backed by ContinuumIO and Travis Oliphant, the NumPy creator.
Answered by chishaku
Post is based on https://stackoverflow.com/questions/14262433/large-data-workflows-using-pandas