I am trying to extract regex patterns from a column using PySpark. I have a data frame which contains the regex patterns and then a table which contains the strings I'd like to match.
columns = ['id', 'text']
vals = [
(1, 'here is a Match1'),
(2, 'Do not match'),
(3, 'Match2 is another example'),
(4, 'Do not match'),
(5, 'here is a Match1')
]
df_to_extract = sql.createDataFrame(vals, columns)
columns = ['id', 'Regex', 'Replacement']
vals = [
(1, 'Match1', 'Found1'),
(2, 'Match2', 'Found2'),
]
df_regex = sql.createDataFrame(vals, columns)
I'd like to match the 'Regex' column within the 'text' column of 'df_to_extract'. I'd like to extract the terms against each id with the resulting table containing the id and 'replacement' which corresponds to the 'Regex'. For example:
+---+------------+
| id| replacement|
+---+------------+
| 1| Found1|
| 3| Found2|
| 5| Found1|
+---+------------+
Thanks!
One way is to use a pyspark.sql.functions.expr, which allows you to use a column value as a parameter, in your join condition.
For example:
from pyspark.sql.functions import expr
df_to_extract.alias("e")\
.join(
df_regex.alias("r"),
on=expr(r"e.text LIKE concat('%', r.Regex, '%')"),
how="inner"
)\
.select("e.id", "r.Replacement")\
.show()
#+---+-----------+
#| id|Replacement|
#+---+-----------+
#| 1| Found1|
#| 3| Found2|
#| 5| Found1|
#+---+-----------+
Here I used the sql expression:
e.text LIKE concat('%', r.Regex, '%')
Which will join all rows where the text column is like the Regex column with the % acting as wildcards to capture anything before and after.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With