I would like to change the following dataframe:
--id--rating--timestamp--
-------------------------
| 0 | 5.0  |  231312231 |
| 1 | 3.0  |  192312311 | #Epoch time (seconds from 1 Thursday, 1 January 1970)
-------------------------
to the following dataframe:
--id--rating--timestamp--
--------------------------
| 0 |  5.0  |  05        |
| 1 |  3.0  |  04        | #Month of year
--------------------------
How I can do that?
Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with. If the column name specified not found, it creates a new column with the value specified.
You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
1. Using Spark withColumnRenamed – To rename DataFrame column name. Spark has a withColumnRenamed() function on DataFrame to change a column name. This is the most straight forward approach; this function takes two parameters; the first is your existing column name and the second is the new column name you wish for.
It's easy using built-in functions
import org.apache.spark.sql.functions._;
import spark.implicits._
val newDF = dataset.withColumn("timestamp", month(from_unixtime('timestamp)));
Note that DataFrames are immutable, so you can create new DataFrame but not modify. Of course you can assign this Dataset to the same variable.
Note number 2: DataFrame = Dataset[Row], that's why I use both names
If you coming from scala, you can use sql.functions methods inside Dataframe.select or Dataframe.withClumn methods, for your case I think the method month(e: Column): Column can perform the change you want. It will be something like that :
import org.apache.spark.sql.functions.month
df.withColumn("timestamp", month("timestamp") as "month")
I do believe that there's an equivalent way in Java, Python and R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With