Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark- size function on elements of vector from count vectorizer?

Background: I have URL data aggregated into a string array. Of this form. [xyz.com,abc.com,efg.com]

1)I filter based on Url count in a row with

vectored_file(size('agg_url_host')>3)

2)I filter out urls that do not occur frequently in the next step with

CountVectorizer(inputCol="agg_url_host",outputCol="vectors",minDF=10000)

The problem is some rows have enough to pass my size function in step 1, but after we remove less frequent urls do not. So I end up with rows with the vectors column reading: (68,[],[]) (68,[4,56],[1.0,1.0]) even if I only want rows with counts higher than 3 for modeling.

So my question is can I run a size function on a vector object like the output of countVectorizer? Or is their a similar function that will remove low counts?

Perhaps there is a way to create a new string array column from my original 'agg_url' column with the less frequent removed? Then I can perform CountVectorizer on that.

Any help appreciated.

like image 479
JB5 Avatar asked Sep 18 '25 19:09

JB5


1 Answers

Size of the output vector is always fixed, so the only thing you can do, is counting non-zero elements:

from pyspark.sql.functions import udf

@udf("long")
def num_nonzeros(v):
    return v.numNonzeros()

df = spark.createDataFrame([
    (1, SparseVector(10, [1, 2, 4, 6], [0.1, 0.3, 0.1, 0.1])),
    (2, SparseVector(10, [], []))
], ("id", "vectors"))

df.where(num_nonzeros("vectors") > 3).show()
# +---+--------------------+      
# | id|             vectors|
# +---+--------------------+
# |  1|(10,[1,2,4,6],[0....|
# +---+--------------------+

but operation operations like this is not very useful feature engineering step in general. Remember that lack of information is information as well.

like image 138
Alper t. Turker Avatar answered Sep 20 '25 07:09

Alper t. Turker