I have a dictionary containing pandas data frames that have the same column names, and I'd like to remove duplicate data frames with identical values and row ids.
Let's assume this is my dictionary of data frames:
>>> dd[0]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[1]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[2]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[3]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[4]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[5]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00
>>> dd[6]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00
I want the result to look like this:
>>> dd[0]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[3]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[5]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00
This is my code leading up to this above-mentioned example:
# Load data as pandas data frame
data = pd.read_csv("website.txt", names = ["Time", "Origin", `"Destination"])`
data["Time"] = pd.to_datetime(data["Time"], infer_datetime_format=True)
# Reverse data frame by index to loop backwards
data = data.reindex(index=df.index[::-1])
dd = {}
for i, e in reverse.iterrows():
dd[i] = data[ (data['Time'] > e['Time']-pd.Timedelta('4 hours')) & (data['Time'] < e['Time'] + pd.Timedelta('4 hours'))]
Original Text:
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{k: v.unstack() for k, v in pd.DataFrame({k: v.stack() for k, v in dd.iteritems()}).T.drop_duplicates().iterrows()}
# iterate through key, value pairs of dictionary,
# stacking each dataframe into a series so that we
# can pass the resulting dataframe into the pd.DataFrame constructor.
df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
# Each column is now one key, value pair from the original dictionary
# Transpose and drop duplicates
df2 = df1.T.drop_duplicates()
# reverse the original stacking and convert back to dictionary
# we could have used df2.T.iteritems() but df2.iterrows() took
# one fewer operations and fewer characters to type.
dd_ = {k: v.unstack() for k, v in df2.iterrows()}
for k, v in dd_.iteritems():
print 'key {}:'.format(k)
print v
print '-' * 10
key 0:
a b
0 1 2
1 3 4
----------
key 2:
a b
0 2 3
1 4 5
----------
from StringIO import StringIO
import pandas as pd
text0 = """ Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00"""
text1 = """ Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00"""
text2 = """ Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00"""
dd = {}
dd[0] = pd.read_csv(StringIO(text0), sep='\s{2,}', index_col=0, engine='python')
dd[0].Time = pd.to_datetime(dd[0].Time)
dd[1] = pd.read_csv(StringIO(text1), sep='\s{2,}', index_col=0, engine='python')
dd[1].Time = pd.to_datetime(dd[1].Time)
dd[2] = pd.read_csv(StringIO(text2), sep='\s{2,}', index_col=0, engine='python')
dd[2].Time = pd.to_datetime(dd[2].Time)
# Then run solutions above:
df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
df2 = df1.T.drop_duplicates()
dd_ = {k: v.unstack() for k, v in df2.iterrows()}
for k, v in dd_.iteritems():
print 'key {}:'.format(k)
print v
print '-' * 10
You should get this:
key 0:
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
----------
key 2:
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
----------
import sys
import pandas as pd
print sys.version
print pd.__version__
2.7.11 |Anaconda custom (x86_64)| (default, Dec 6 2015, 18:57:58)
[GCC 4.2.1 (Apple Inc. build 5577)]
0.18.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With