Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus.
You can train separate NGram and CountVectorizer models and merge using VectorAssembler.
from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler
from pyspark.ml import Pipeline
def build_ngrams(inputCol="tokens", n=3):
ngrams = [
NGram(n=i, inputCol="tokens", outputCol="{0}_grams".format(i))
for i in range(1, n + 1)
]
vectorizers = [
CountVectorizer(inputCol="{0}_grams".format(i),
outputCol="{0}_counts".format(i))
for i in range(1, n + 1)
]
assembler = [VectorAssembler(
inputCols=["{0}_counts".format(i) for i in range(1, n + 1)],
outputCol="features"
)]
return Pipeline(stages=ngrams + vectorizers + assembler)
Example usage:
df = spark.createDataFrame([
(1, ["a", "b", "c", "d"]),
(2, ["d", "e", "d"])
], ("id", "tokens"))
build_ngrams().fit(df).transform(df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With