Problem
I’m making a DataFrame from a csv in the following way:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
There is a date column in the DataFrame. Is it possible to construct a new DataFrame (or simply overwrite an existing one) that only contains rows with date values that fall within a given date range or between two dates?
Asked by darkpool
Solution #1
There are two options available:
Using a boolean mask entails the following steps:
Ascertain that df[‘date’] is a Series of dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask by combining the words “boolean” and “mask.” Datetime can be used for both start date and end date. datetimes, np.datetime64s, pd.Timestamps, and even datetime strings are all examples of datetimes.
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
Alternatively, re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you’re going to make a lot of date selections, it could be easier to make the date column the index first. Then use df.loc[start date:end date] to pick rows by date.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
Unlike Python list indexing, which includes start but not end (e.g. seq[start:end]), Pandas df.loc[start date: end date] includes both end-points if they are in the index. However, neither start date nor end date must be in the index.
Also, the parse dates parameter in pd.read csv can be used to parse the date column into datetime64s. You won’t need to use df[‘date’] = pd.to datetime(df[‘date’] if you use parse dates.
Answered by unutbu
Solution #2
I believe that utilizing direct checks rather than the loc function is the best option:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It seems to work for me.
The restrictions must be present in the real values for the loc function with a slice to work; otherwise, KeyError will occur.
Answered by Christin Jose
Solution #3
You can also use the phrase between:
df[df.some_date.between(start_date, end_date)]
Answered by pomber
Solution #4
On the date column, you can use the isin technique as follows: df[df[“date”]. [pd.date range(start date, end date)] isin(pd.date range(start date, end date)] isin(pd.date range(start
Note that this only works with dates, not timestamps, as the question suggests.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
Answered by Jonny Brooks
Solution #5
I recommend that you attempt this method because it is simple and pythonic.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
Answered by Abhinav Anand
Post is based on https://stackoverflow.com/questions/29370057/select-dataframe-rows-between-two-dates