I'm currently studying a text corpus. 
Let's say that I cleaned my verbatims and that I have the following pyspark DataFrame :
df = spark.createDataFrame([(0, ["a", "b", "c"]),
                            (1, ["a", "b", "b", "c", "a"])],
                            ["label", "raw"])
df.show()
+-----+---------------+
|label|            raw|
+-----+---------------+
|    0|      [a, b, c]|
|    1|[a, b, b, c, a]|
+-----+---------------+
I now want to implement a CountVectorizer. So, I used pyspark.ml.feature.CountVectorizer as follows :
cv = CountVectorizer(inputCol="raw", outputCol="vectors")
model = cv.fit(df)
model.transform(df).show(truncate=False)
+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+
Now, I would also like to get the vocabulary that the CountVectorizer selected, as well as the corresponding word frequencies in the corpus. 
 Using cvmodel.vocabulary only provides the vocabulary list :
voc = cvmodel.vocabulary
voc
[u'b', u'a', u'c']
I would like to get something like that :
voc = {u'a':3,u'b':3,u'c':2}
Would you have any idea to do such a things?
Edit:
 I am using Spark 2.1
Calling cv.fit() returns a CountVectorizerModel, which (AFAIK) stores the vocabulary but it does not store the counts. The vocabulary is property of the model (it needs to know what words to count), but the counts are a property of the DataFrame (not the model). You can apply the transform function of the fitted model to get the counts for any DataFrame.
That being said, here are two ways to get the output you desire.
1. Using Existing Count Vectorizer Model
You can use pyspark.sql.functions.explode() and pyspark.sql.functions.collect_list() to gather the entire corpus into a single row. For illustrative purposes, let's consider a new DataFrame df2 which contains some words unseen by the fitted CountVectorizer:
import pyspark.sql.functions as f
df2 = sqlCtx.createDataFrame([(0, ["a", "b", "c", "x", "y"]),
                            (1, ["a", "b", "b", "c", "a"])],
                            ["label", "raw"])
combined_df = (
    df2.select(f.explode('raw').alias('col'))
      .select(f.collect_list('col').alias('raw'))
)
combined_df.show(truncate=False)
#+------------------------------+
#|raw                           |
#+------------------------------+
#|[a, b, c, x, y, a, b, b, c, a]|
#+------------------------------+
Then use the fitted model to transform this into counts and collect the results:
counts = model.transform(combined_df).select('vectors').collect()
print(counts)
#[Row(vectors=SparseVector(3, {0: 3.0, 1: 3.0, 2: 2.0}))]
Next zip the counts and the vocabulary together and use the dict constructor to get the desired output:
print(dict(zip(model.vocabulary, counts[0]['vectors'].values)))
#{u'a': 3.0, u'b': 3.0, u'c': 2.0}
As you correctly pointed out in the comments, this will only consider the words that are part of the CountVectorizerModel's vocabulary. Any other words will be ignored. Hence we don't see any entries for "x" or "y".
2. Use DataFrame aggregate functions
Or you can skip the CountVectorizer and get your output using a groupBy(). This is a more generic solution in that it will give the counts for all words in the DataFrame, not just those in the vocabulary:
counts = df2.select(f.explode('raw').alias('col')).groupBy('col').count().collect()
print(counts)
#[Row(col=u'x', count=1), Row(col=u'y', count=1), Row(col=u'c', count=2), 
# Row(col=u'b', count=3), Row(col=u'a', count=3)]
Now simply use a dict comprehension:
print({row['col']: row['count'] for row in counts})
#{u'a': 3, u'b': 3, u'c': 2, u'x': 1, u'y': 1}
Here we have the counts for "x" and "y" as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With