I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns:
from pyspark.ml.stat import ChiSquareTest
r = ChiSquareTest.test(df, 'feature1', 'feature2')
However, it gives me the error:
IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'
The datatype for feature1 is:
feature1: double (nullable = true)
Could you please help me with this regard?
spark-ml is not the typical statistics library. It is very ML oriented. Therefore it assumes that you will want to run a test between a label and a feature or a group of features.
Therefore, similarly to when you train a model, you need to assemble the features you want to test against the label.
In your case, you can just assemble feature1 as follows:
from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.feature import VectorAssembler
data = [(1, 2), (3, 4), (2, 1), (4, 3)]
df = spark.createDataFrame(data, ['feature1', 'feature2'])
assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')
ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)
Just in case, the code in scala:
import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
.toDF("features", "feature2", "feature3")
val assembler = new VectorAssembler()
.setInputCols(Array("feature1"))
.setOutputCol("features")
ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)
To expand on Oli's answer, Spark ML expects features to be stored in instances of pyspark.ml.linalg.Vector. There are two kinds of vectors:
array<T>
size that indicates the full dimension of the vectorindices array that holds the positions of the non-zero elementsvalues array that holds the values of the non-zero elementsBoth vector types are actually represented using the structure for sparse vectors, whereas for dense vectors the indices array goes unused and values stores all of the values. The first structure element, type, is used to distinguish between the two kinds.
So, if you see an error that something expects struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, that means you are supposed to pass an instances of pyspark.ml.linagl.Vector and not mere numbers.
In order to produce Vectors, you can either use pyspark.ml.feature.VectorAssembler to assemble one or more independent feature columns into a single vector column or manually construct them using the factory methods Vectors.dense() (for dense vectors) and Vectors.sparse() (for sparse vectors) of the factory object pyspark.ml.linalg.Vectors. Using VectorAssembler is probably easier and also faster since it's implemented in Scala. For use of explicit vector creation, consult the example for ChiSquareTest in the PySpark documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With