PySpark: how to read in partitioning columns when reading parquet

Question

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.

I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:

paths_to_files = ['hdfs://data/table_name/2018/10/29',
                  'hdfs://data/table_name/2018/10/30']

And then try to do something like:

df = sqlContext.read.format("parquet").load(paths_to_files)

However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.

I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.

I have two questions:

what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the approach i outlined above?

OneCricketeer · Accepted Answer

Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.

Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.

Not sure why you think you need to read/write SQL, though.

Use the Dataframe API instead, e.g

df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()

PySpark: how to read in partitioning columns when reading parquet

Tags:

apache-spark

apache-spark-sql

hadoop

pyspark

parquet

ira

1 Answers

OneCricketeer

Recent Activity

Donate For Us

PySpark: how to read in partitioning columns when reading parquet

Tags:

apache-spark

apache-spark-sql

hadoop

pyspark

parquet

ira

1 Answers

OneCricketeer

Related questions

Recent Activity

Donate For Us