Given the below data frame, i wanted to split the numbers column into an array of 3 characters per element of the original number in the array
Given data frame :
+---+------------------+
| id| numbers|
+---+------------------+
|742| 000000000|
|744| 000000|
|746|003000000000000000|
+---+------------------+
Expected dataframe :
+---+----------------------------------+
| id| numbers |
+---+----------------------------------+
|742| [000, 000, 000] |
|744| [000, 000] |
|746| [003, 000, 000, 000, 000, 000] |
+---+----------------------------------+
I tried different regular expressions while using the split
function given below the with the regex that I felt should have worked on the very first try:
import pyspark.sql.functions as f
df = spark.createDataFrame(
[
[742, '000000000'],
[744, '000000'],
[746, '003000000000000000'],
],
["id", "numbers"]
)
df = df.withColumn("numbers", f.split("numbers", "[0-9]{3}"))
df.show()
The result however is
+---+--------------+
| id| numbers|
+---+--------------+
|742| [, , , ]|
|744| [, , ]|
|746|[, , , , , , ]|
+---+--------------+
I want to understand what I am doing wrong. Is there a possibility of setting the global flag for getting all the matches or have I missed something in the regular expression altogether?
Here's how you can do this without using a udf
:
df = df.withColumn(
"numbers",
f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)
df.show(truncate=False)
#+---+------------------------------+
#|id |numbers |
#+---+------------------------------+
#|742|[000, 000, 000] |
#|744|[000, 000] |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+
First use pyspark.sql.functions.regexp_replace
to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.
The replacement pattern "$1,"
means first capturing group, followed by a comma.
In the match pattern, we also include a negative lookahead for end of string, (?!$)
, to avoid adding a comma to the end of the string.
Reference: REGEXP_REPLACE capturing groups
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With