From some brief testing, it appears that the column drop function for pyspark dataframes is not case sensitive, eg.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import sys
sparkSession = SparkSession.builder.appName("my-session").getOrCreate()
dff = sparkSession.createDataFrame([(10,123), (14,456), (16,678)], ["age", "AGE"])
>>> dff.show()
+---+---+
|age|AGE|
+---+---+
| 10|123|
| 14|456|
| 16|678|
+---+---+
>>> dff.drop("AGE")
DataFrame[]
>>> dff_dropped = dff.drop("AGE")
>>> dff_dropped.show()
++
||
++
||
||
||
++
"""
What I'd like to see here is:
+---+
|age|
+---+
| 10|
| 14|
| 16|
+---+
"""
Is there a way to drop dataframe columns in a case sensitive way? (Have seen some comments related to something like this in spark JIRA discussions, but was looking for something at only applied to the drop() operation in an ad hoc way (not a global / persistent setting)).
#Add this before using drop
sqlContext.sql("set spark.sql.caseSensitive=true")
You need to set casesensitivity as true if you have two columns having same name
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With