For an assignment I need to extract all "mentions" in a comment. In plain python I would do something like this:
string = "@rjberger10 @geneh19 home"
re.findall(r'@\w+', string)
This would give me an array like this: ['@rjberger10', '@geneh19']. However the assignment states that we have to use the PySpark way, but I can't find a function in PySpark similar to findall(). The closest I got was this:
result = dict_comments[13].withColumn("@tjes", regexp_extract(col("Text"), r'(@\w+)', 0))
However this only gives me the first time an @ is used, so when there are multiple mentions I only find one.
I have an approach using explode
, split
, regexp_extract
and collect_list
.
re_df = spark._sc.parallelize([[1,"@rjberger10"],
[2,"@geneh19"],
[3,"home"],
[4,"@geneh19 @rjberger10"]]).toDF(["id","test_string"])
temp_column_name = "explode_on_split"
(
re_df.withColumn(temp_column_name,
f.explode(f.split(f.col("test_string"), " ")))
.withColumn('extract',
f.regexp_extract(f.col(temp_column_name), r'(@\w+)', 1))
.groupBy(f.col('test_string'))
.agg(
f.collect_list(f.col('extract')).alias('final_extract'))
).show()
+--------------------+--------------------+
| test_string| final_extract|
+--------------------+--------------------+
|@geneh19 @rjberger10|[@geneh19, @rjber...|
| @rjberger10| [@rjberger10]|
| home| []|
| @geneh19| [@geneh19]|
+--------------------+--------------------+
I know this looks cumbersome, so to explain: I am splitting on space, exploding the output into rows, regex-extracing the output and collecting in this order. However, if you look at the Spark GitHub page, you can expect this function to pop-up very soon as there's already an open PR (https://github.com/apache/spark/pull/21985). Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With