There is a function in pyspark:
def sum(a,b):
c=a+b
return c
It has to be run on each record of a very very large dataframe using spark sql:
x = sum(df.select["NUM1"].first()["NUM1"], df.select["NUM2"].first()["NUM2"])
But this would run it only for the first record of the df and not for all rows. I understand it could be done using a lambda, but I am not able to code it in the desired way.
In reality; c would be a dataframe and the function would be doing a lot of spark.sql stuff and return it. I would have to call that function for each row.
I guess, I will try to pick it up using this sum(a,b)
as an analogy.
+----------+----------+-----------+
| NUM1 | NUM2 | XYZ |
+----------+----------+-----------+
| 10 | 20 | HELLO|
| 90 | 60 | WORLD|
| 50 | 45 | SPARK|
+----------+----------+-----------+
+----------+----------+-----------+------+
| NUM1 | NUM2 | XYZ | VALUE|
+----------+----------+-----------+------+
| 10 | 20 | HELLO|30 |
| 90 | 60 | WORLD|150 |
| 50 | 45 | SPARK|95 |
+----------+----------+-----------+------+
Python: 3.7.4
Spark: 2.2
You can use .withColumn function:
from pyspark.sql.functions import col
from pyspark.sql.types import LongType
df.show()
+----+----+-----+
|NUM1|NUM2| XYZ|
+----+----+-----+
| 10| 20|HELLO|
| 90| 60|WORLD|
| 50| 45|SPARK|
+----+----+-----+
def mysum(a,b):
return a + b
spark.udf.register("mysumudf", mysum, LongType())
df2 = df.withColumn("VALUE", mysum(col("NUM1"),col("NUM2"))
df2.show()
+----+----+-----+-----+
|NUM1|NUM2| XYZ|VALUE|
+----+----+-----+-----+
| 10| 20|HELLO| 30|
| 90| 60|WORLD| 150|
| 50| 45|SPARK| 95|
+----+----+-----+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With