Order of rows shown changes on selection of columns from dependent pyspark dataframe

Question

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?

Here is the original dataframe:

enter image description here

Here dates are in the given order, as you can see, via show.

Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe. enter image description here

Ihor Konovalenko · Accepted Answer

Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.

So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:

df.orderBy('date').show()

In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:

df.orderBy(col("date").asc, col("other_column").desc)

In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Tags:

apache-spark

apache-spark-sql

pyspark

sunakshi132

1 Answers

Ihor Konovalenko

Recent Activity

Donate For Us

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Tags:

apache-spark

apache-spark-sql

pyspark

sunakshi132

1 Answers

Ihor Konovalenko

Related questions

Recent Activity

Donate For Us