Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a function in PySpark similar to the re.findall() function of python?

For an assignment I need to extract all "mentions" in a comment. In plain python I would do something like this:

string = "@rjberger10 @geneh19 home"
re.findall(r'@\w+', string)

This would give me an array like this: ['@rjberger10', '@geneh19']. However the assignment states that we have to use the PySpark way, but I can't find a function in PySpark similar to findall(). The closest I got was this:

result = dict_comments[13].withColumn("@tjes", regexp_extract(col("Text"), r'(@\w+)', 0))

However this only gives me the first time an @ is used, so when there are multiple mentions I only find one.

like image 697
Tibo Geysen Avatar asked Sep 05 '25 03:09

Tibo Geysen


1 Answers

I have an approach using explode, split, regexp_extract and collect_list.

re_df = spark._sc.parallelize([[1,"@rjberger10"], 
                               [2,"@geneh19"],
                               [3,"home"],
                               [4,"@geneh19 @rjberger10"]]).toDF(["id","test_string"])


temp_column_name = "explode_on_split"

(
    re_df.withColumn(temp_column_name, 
                     f.explode(f.split(f.col("test_string"), " "))) 
         .withColumn('extract', 
                     f.regexp_extract(f.col(temp_column_name), r'(@\w+)', 1))
         .groupBy(f.col('test_string'))
         .agg(
           f.collect_list(f.col('extract')).alias('final_extract'))
).show()

+--------------------+--------------------+
|         test_string|       final_extract|
+--------------------+--------------------+
|@geneh19 @rjberger10|[@geneh19, @rjber...|
|         @rjberger10|       [@rjberger10]|
|                home|                  []|
|            @geneh19|          [@geneh19]|
+--------------------+--------------------+

I know this looks cumbersome, so to explain: I am splitting on space, exploding the output into rows, regex-extracing the output and collecting in this order. However, if you look at the Spark GitHub page, you can expect this function to pop-up very soon as there's already an open PR (https://github.com/apache/spark/pull/21985). Hope this helps!

like image 137
Napoleon Borntoparty Avatar answered Sep 08 '25 01:09

Napoleon Borntoparty