Since no out-of-box support for reading excel files in spark, so i first read the excel file first into a pandas dataframe, then try to convert the pandas dataframe into a spark dataframe but i got below errors (i am using spark 1.5.1)
import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
pdf=pd.read_excel('/home/testdata/test.xlsx')
df = sqlContext.createDataFrame(pdf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame
rdd, schema = self._createFromLocal(data, schema)
File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 337, in _createFromLocal
data = [schema.toInternal(row) for row in data]
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in toInternal
return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in <genexpr>
return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 435, in toInternal
return self.dataType.toInternal(obj)
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 191, in toInternal
else time.mktime(dt.timetuple()))
AttributeError: 'datetime.time' object has no attribute 'timetuple'
Does anybody know how to fix it?
My best guess your problem was about "incorrectly" parsing datetime data when you read your data with Pandas
The following code "just works":
import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
pdf = pd.read_excel('test.xlsx', parse_dates=['Created on','Confirmation time'])
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext.createDataFrame(data=pdf).collect()
[Row(Customer=1000935702, Country='TW', ...
Please note, you have one more datetime column 'Confirmation date' which in your example consists of NaT and thus reads without a problem to RDD with your short sample, but should you happen to have some data there in a full dataset you'll have to take care about that column as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With