I have a spark dataframe, results, that has two string columns I would like to cast to numeric:
>>> results.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             "43"|                    "20"|
|"BAYLOR MEDICAL C...|             "32"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"MASONIC HOME AND...|  "Not Available"|         "Not Available"|
|"ST HELENA HOSPITAL"|             "41"|                    "20"|
|   "TOURO INFIRMARY"|             "15"|                    "18"|
|"WAHIAWA GENERAL ...|             "17"|                    "10"|
|"ANNA JAQUES HOSP...|             "27"|                    "18"|
|    "CMC-BLUE RIDGE"|             "31"|                    "18"|
|"EVANSTON REGIONA...|             "15"|                    "15"|
|"OKLAHOMA SPINE H...|             "79"|                    "20"|
|"PICKENS COUNTY M...|  "Not Available"|         "Not Available"|
|"PORTNEUF MEDICAL...|             "11"|                    "17"|
|"PRESENCE SAINT J...|             "20"|                    "17"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"SOUTH GEORGIA ME...|    "3 out of 10"|                    "24"|
|"TAMPA GENERAL HO...|             "23"|                    "16"|
+--------------------+-----------------+------------------------+
Attempting such gives me a table of null values:
>>> results2 = results.select( results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score") )
>>> results2.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             null|                    null|
|"BAYLOR MEDICAL C...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"MASONIC HOME AND...|             null|                    null|
|"ST HELENA HOSPITAL"|             null|                    null|
|   "TOURO INFIRMARY"|             null|                    null|
|"WAHIAWA GENERAL ...|             null|                    null|
|"ANNA JAQUES HOSP...|             null|                    null|
|    "CMC-BLUE RIDGE"|             null|                    null|
|"EVANSTON REGIONA...|             null|                    null|
|"OKLAHOMA SPINE H...|             null|                    null|
|"PICKENS COUNTY M...|             null|                    null|
|"PORTNEUF MEDICAL...|             null|                    null|
|"PRESENCE SAINT J...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"SOUTH GEORGIA ME...|             null|                    null|
|"TAMPA GENERAL HO...|             null|                    null|
+--------------------+-----------------+------------------------+
only showing top 20 rows
Is it not possible to cast string columns to integer in pyspark?
First you better need to strip off double quotes, then you should be able to convert to IntegerType. You can use below udf to accomplish it.
>>> def stripDQ(string):
...  return string.replace('"', "")
... 
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType, IntegerType
>>> udf_stripDQ = udf(stripDQ, StringType())
We will use it..
Your actual DataFrame:
>>> results.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|             "43"|                    "20"|
|"BAYLOR MEDICAL C"|             "32"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"MASONIC HOME AND"|  "Not Available"|         "Not Available"|
+------------------+-----------------+------------------------+
Now, we will use our udf to strip off double-quotes from both columns.
>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"]) ).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"]) )
>>> results1.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|               43|                      20|
|"BAYLOR MEDICAL C"|               32|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"MASONIC HOME AND"|    Not Available|           Not Available|
+------------------+-----------------+------------------------+
Now cast to integer:
>>> results2 = results1.select( results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score") )
>>> results2.show()
+------------------+-----------------+---------------------+
|     Hospital Name|HCAHPS Base Score|HPS Consistency Score|
+------------------+-----------------+---------------------+
|"ADIRONDACK MEDIC"|               43|                   20|
|"BAYLOR MEDICAL C"|               32|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"MASONIC HOME AND"|             null|                 null|
+------------------+-----------------+---------------------+
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With