I have a csv file with 1 minute stock data spanning multiple days. Each day runs from 9:30 to 16:00.
Some of the minutes in the time series are missing: (here 2013-09-16 09:32:00 and 2013-09-17 09:31:00 are missing)
2013-09-16 09:30:00,461.01,461.49,461,461,183507
2013-09-16 09:31:00,460.82,461.6099,460.39,461.07,212774
2013-09-16 09:33:00,460.0799,460.88,458.97,459.2401,207880
2013-09-16 09:34:00,458.97,460.08,458.8,460.04,148121
...
2013-09-16 15:59:00,449.72,450.0774,449.59,449.95,146399
2013-09-16 16:00:00,450.12,450.12,449.65,449.65,444594
2013-09-17 09:30:00,448,448,447.5,447.96,173624
2013-09-17 09:32:00,450.6177,450.9,449.05,449.2701,268715
2013-09-17 09:33:00,451.39,451.96,450.58,450.7061,197019
...
...
With pandas, how do I forward fill the series so every minute is present? I should look like this:
2013-09-16 09:30:00,461.01,461.49,461,461,183507
2013-09-16 09:31:00,460.82,461.6099,460.39,461.07,212774
2013-09-16 09:32:00,460.82,461.6099,460.39,461.07,212774 <-- forward filled
2013-09-16 09:33:00,460.0799,460.88,458.97,459.2401,207880
2013-09-16 09:34:00,458.97,460.08,458.8,460.04,148121
...
2013-09-16 15:59:00,449.72,450.0774,449.59,449.95,146399
2013-09-16 16:00:00,450.12,450.12,449.65,449.65,444594
2013-09-17 09:30:00,448,448,447.5,447.96,173624
2013-09-17 09:31:00,448,448,447.5,447.96,173624 <-- forward filled
2013-09-17 09:32:00,450.6177,450.9,449.05,449.2701,268715
2013-09-17 09:33:00,451.39,451.96,450.58,450.7061,197019
...
It also needs to account for if multiple consecutive minutes are missing...
So I copied your first 4 lines into a dataframe:
Out[49]:
0 1 2 3 4 5
0 2013-09-16 09:30:00 461.0100 461.4900 461.00 461.0000 183507
1 2013-09-16 09:31:00 460.8200 461.6099 460.39 461.0700 212774
2 2013-09-16 09:33:00 460.0799 460.8800 458.97 459.2401 207880
3 2013-09-16 09:34:00 458.9700 460.0800 458.80 460.0400 148121
then
df1 = df.set_index(keys=[0]).resample('1min', fill_method='ffill')
df1
Out[52]:
1 2 3 4 5
0
2013-09-16 09:30:00 461.0100 461.4900 461.00 461.0000 183507
2013-09-16 09:31:00 460.8200 461.6099 460.39 461.0700 212774
2013-09-16 09:32:00 460.8200 461.6099 460.39 461.0700 212774
2013-09-16 09:33:00 460.0799 460.8800 458.97 459.2401 207880
2013-09-16 09:34:00 458.9700 460.0800 458.80 460.0400 148121
This will also deal with multiple missing values and forward fill them.
So if I have data like
2013-09-17 09:30:00,448,448,447.5,447.96,173624
2013-09-17 09:33:00,451.39,451.96,450.58,450.7061,197019
and do the same thing as before this gives:
Out[55]:
1 2 3 4 5
0
2013-09-17 09:30:00 448.00 448.00 447.50 447.9600 173624
2013-09-17 09:31:00 448.00 448.00 447.50 447.9600 173624
2013-09-17 09:32:00 448.00 448.00 447.50 447.9600 173624
2013-09-17 09:33:00 451.39 451.96 450.58 450.7061 197019
The key thing here is you must have a datetimeindex, if you want to keep it as a column then you can just set drop=False in set_index.
This might work slightly better for you as it takes into account different days such that you don't have to fill for each day:
Just creating the dataframe:
list1 = [["2013-09-16 09:29:00","461.01","461.49","461","461","183507"],
["2013-09-16 09:31:00", "460.82", "461.6099", "460.39", "461.07", "212774"],
["2013-09-16 09:34:00", "460.0799", "460.88", "458.97", "459.2401", "207880"],
["2013-09-17 09:35:00", "458.97", "460.08", "458.8", "460.04", "148121"]]
cols = ['date','price1','price2','price3', 'price4', 'price5']
df = DataFrame(list1, columns=cols)
Setting the index to be the date column:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
Reindexing and filling the holes, then forward filling the resulting NaN values, then dropping all times outside of 9:30AM to 4:00PM:
df2 = df.reindex(pd.date_range(df.index[0], df.index[-1], freq='Min')).ffill().ix[df2.index.indexer_between_time(pd.datetime(year=1,month=1,day=1,hour=9,minute=30).time(), datetime.time(16))]
Those statements can be broken out sequentially:
First, reindex the dataframe so that your index corresponds to your starting date/time through your ending date/time with a frequency of 1 minute:
df2 = df.reindex(pd.date_range(df.index[0], df.index[-1], freq='Min'))
This will create a lot of NaN values where the new index didn't line up with the old one. We fill this with ffill (forward fill), though there are other options out there:
df2.ffill(inplace=True)
Then finally, strip out the times that are outside your 9:30AM to 4:00PM time range:
df_final = df2.ix[df2.index.indexer_between_time(pd.datetime(year=1,month=1,day=1,hour=9,minute=30).time(), datetime.time(16))]
Because .time() doesn't take 9.5 and the documentation is kind of sparse, I just created a datetime object with the time value set to 9:30AM and then used .time() to grab this. There's a better way, I'm sure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With