Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string in a spark dataframe column by regular expressions capturing groups

Given the below data frame, i wanted to split the numbers column into an array of 3 characters per element of the original number in the array

Given data frame :

+---+------------------+
| id|           numbers|
+---+------------------+
|742|         000000000|
|744|            000000|
|746|003000000000000000|
+---+------------------+

Expected dataframe :

+---+----------------------------------+
| id|           numbers                |
+---+----------------------------------+
|742| [000, 000, 000]                  |
|744| [000, 000]                       |
|746| [003, 000, 000, 000, 000, 000]   |
+---+----------------------------------+

I tried different regular expressions while using the split function given below the with the regex that I felt should have worked on the very first try:

import pyspark.sql.functions as f

df = spark.createDataFrame(
    [
        [742, '000000000'], 
        [744, '000000'], 
        [746, '003000000000000000'], 
    ],
    ["id", "numbers"]
)

df = df.withColumn("numbers", f.split("numbers", "[0-9]{3}"))

df.show()

The result however is

+---+--------------+
| id|       numbers|
+---+--------------+
|742|      [, , , ]|
|744|        [, , ]|
|746|[, , , , , , ]|
+---+--------------+

I want to understand what I am doing wrong. Is there a possibility of setting the global flag for getting all the matches or have I missed something in the regular expression altogether?

like image 864
Shadab Shariff Avatar asked Sep 12 '25 18:09

Shadab Shariff


1 Answers

Here's how you can do this without using a udf:

df = df.withColumn(
    "numbers",
    f.split(f.regexp_replace("numbers", "([0-9]{3})(?!$)", r"$1,"), ",")
)

df.show(truncate=False)
#+---+------------------------------+
#|id |numbers                       |
#+---+------------------------------+
#|742|[000, 000, 000]               |
#|744|[000, 000]                    |
#|746|[003, 000, 000, 000, 000, 000]|
#+---+------------------------------+

First use pyspark.sql.functions.regexp_replace to replace sequences of 3 digits with the sequence followed by a comma. Then split the resulting string on a comma.

The replacement pattern "$1," means first capturing group, followed by a comma.

In the match pattern, we also include a negative lookahead for end of string, (?!$), to avoid adding a comma to the end of the string.

Reference: REGEXP_REPLACE capturing groups

like image 175
pault Avatar answered Sep 15 '25 20:09

pault