Call a function for each row of a dataframe in pyspark[non pandas]

Question

There is a function in pyspark:

def sum(a,b):
    c=a+b
    return c

It has to be run on each record of a very very large dataframe using spark sql:

x = sum(df.select["NUM1"].first()["NUM1"], df.select["NUM2"].first()["NUM2"])

But this would run it only for the first record of the df and not for all rows. I understand it could be done using a lambda, but I am not able to code it in the desired way.

In reality; c would be a dataframe and the function would be doing a lot of spark.sql stuff and return it. I would have to call that function for each row. I guess, I will try to pick it up using this sum(a,b) as an analogy.

+----------+----------+-----------+
|     NUM1 |     NUM2 |    XYZ    |
+----------+----------+-----------+
|      10  |     20   |      HELLO|                                    
|      90  |     60   |      WORLD|
|      50  |     45   |      SPARK|
+----------+----------+-----------+


+----------+----------+-----------+------+
|     NUM1 |     NUM2 |    XYZ    | VALUE|
+----------+----------+-----------+------+
|      10  |     20   |      HELLO|30    |                                     
|      90  |     60   |      WORLD|150   |
|      50  |     45   |      SPARK|95    |
+----------+----------+-----------+------+

Python: 3.7.4
Spark: 2.2

Vincent · Accepted Answer

You can use .withColumn function:

from pyspark.sql.functions import col
from pyspark.sql.types import LongType
df.show()
+----+----+-----+
|NUM1|NUM2|  XYZ|
+----+----+-----+
|  10|  20|HELLO|
|  90|  60|WORLD|
|  50|  45|SPARK|
+----+----+-----+

def mysum(a,b):
  return a + b

spark.udf.register("mysumudf", mysum, LongType())

df2 = df.withColumn("VALUE", mysum(col("NUM1"),col("NUM2"))

df2.show()
+----+----+-----+-----+
|NUM1|NUM2|  XYZ|VALUE|
+----+----+-----+-----+
|  10|  20|HELLO|   30|
|  90|  60|WORLD|  150|
|  50|  45|SPARK|   95|
+----+----+-----+-----+

Call a function for each row of a dataframe in pyspark[non pandas]

Tags:

apache-spark

apache-spark-sql

pyspark

earl

1 Answers

Vincent

Recent Activity

Donate For Us

Call a function for each row of a dataframe in pyspark[non pandas]

Tags:

apache-spark

apache-spark-sql

pyspark

earl

1 Answers

Vincent

Related questions

Recent Activity

Donate For Us