there are two types compress file format for spark. one is parquet, it's very easy to read:
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
hiveCtx.parquetFile(parquetFile)
but for ocr file. I cannot find a good example to show me how to use pyspark to read.
Well, there is two ways:
Spark 2.x:
orc_df = spark.read.orc('python/test_support/sql/orc_partitioned')
Spark 1.6:
df = hiveContext.read.orc('python/test_support/sql/orc_partitioned')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With