Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the Integer value of a column in .withColumn function? [Spark - Scala]

I need to use date_add() function to add 90 days to a dataframe's column. The function works correctly, but only when I hardcode the 90. If the number is in another column and I refer it, the function asks me for an integer to work.

This code works:

.withColumn("DATE_SUM_COLUMN",date_add(col("DATE_COLUMN"),90))

This code does not:

.withColumn("DATE_SUM_COLUMN",date_add(col("DATE_COLUMN"),col("number")))

Thanks.

like image 400
Sorul Avatar asked Nov 16 '25 20:11

Sorul


2 Answers

You still may use expr("date_add(date_column, days_to_add)") function to evaluate a Spark SQL string:

import java.sql.Date

import com.holdenkarau.spark.testing.{DataFrameSuiteBase, SharedSparkContext}
import org.scalatest.FlatSpec
import org.apache.spark.sql.functions.expr

class TestSo2 extends FlatSpec with SharedSparkContext with DataFrameSuiteBase {
  "date_add" should "add number of dates specified as Column" in {
    import spark.implicits._
    val df = Seq(
      (Date.valueOf("2019-01-01"), 31),
      (Date.valueOf("2019-01-01"), 32)
    ).toDF("date_column", "days_to_add")
    df.show()

    /**
     * +-----------+-----------+
     * |date_column|days_to_add|
     * +-----------+-----------+
     * | 2019-01-01|         31|
     * | 2019-01-01|         32|
     * +-----------+-----------+
     */

    df.
      withColumn(
        "next_date",
        expr("date_add(date_column, days_to_add)")
      ).
      show

    /**
     * +-----------+-----------+----------+
     * |date_column|days_to_add| next_date|
     * +-----------+-----------+----------+
     * | 2019-01-01|         31|2019-02-01|
     * | 2019-01-01|         32|2019-02-02|
     * +-----------+-----------+----------+
     */
  }
}

I don't know the reasons why spark developers have not made it as a part of Scala API though.

like image 122
Andrei Luksha Avatar answered Nov 18 '25 09:11

Andrei Luksha


Please try this here I am converting date to seconds, converting days column to seconds and summing the two columnsns. Again we have to convert the final result to date format. Here date is my date column, add is days to add for the date column

import org.apache.spark.sql.functions._

.withColumn("new col", unix_timestamp($"date", "yyyy-MM-dd") + col("add")*24*60*60)

like image 36
Ravi Avatar answered Nov 18 '25 08:11

Ravi