I'm a newbie in PySpark.
I have a Spark DataFrame df that has a column 'device_type'. 
I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".
In Python I can do the following,
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)
How can I achieve this using PySpark? Thanks!
You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.
As I said in the beginning, PySpark doesn't have a Dictionary type instead it uses MapType to store the dictionary object, below is an example of how to create a DataFrame column MapType using pyspark. sql. types. StructType .
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
Update the column value Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.
You can use either na.replace:
df = spark.createDataFrame([
    ('Tablet', ), ('Phone', ),  ('PC', ), ('Other', ), (None, )
], ["device_type"])
df.na.replace(deviceDict, 1).show()
+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+
or map literal:
from itertools import chain
from pyspark.sql.functions import create_map, lit
mapping = create_map([lit(x) for x in chain(*deviceDict.items())])
df.select(mapping[df['device_type']].alias('device_type'))
+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|       null|
|       null|
+-----------+
Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:
from pyspark.sql.functions import coalesce
df.select(
    coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)
+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With