After creating a Spark DataFrame from a CSV file, I would like to trim a column. I've tried:
df = df.withColumn("Product", df.Product.strip()) df is my data frame, Product is a column in my table
But I get the error:
Column object is not callable Any suggestions?
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
length. Computes the character length of string data or number of bytes of binary data. The length of character data includes the trailing spaces.
from pyspark.sql.functions import trim df = df.withColumn("Product", trim(col("Product")))
Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions first. Here is an example:
from pyspark.sql import SQLContext from pyspark.sql.functions import * sqlContext = SQLContext(sc) df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings df.collect() # [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')] df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1 df.collect() # [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')] df = df.withColumn('d1', rtrim(df.d1)) # trim right whitespace from d1 df.collect() # [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With