What should be the correct PySpark Schema for the following list of tuples? I want to apply the schema on the following data:
[('a', 0.0), ('b', 6), ('c', 44), ('d', 107), ('e', 0), ('f', 3), ('g', 4), ('h', 0.025599999353289604), ('i', 0.03239999711513519), ('j', -0.03205680847167969), ('k', 0.10429033637046814), ('l', (34.190006256103516, 31.09000015258789, 31.099994659423828)), ('m', (-9.32000732421875, -9.32000732421875, -11.610000610351562)) ]
I want the result in the following format: Format
Tanjin answer should work although I would like to suggest another approach. Instead of finding out how many columns you should add to your schema to create a column of type array/list. The next code transforms your data into a rdd which instead of tuples contains rows of [key, value] where value is a list of double. Then you can easily apply the schema below.
def test():
l = [('a', 0.0),
('b', 6),
('c', 44),
('d', 107),
('e', 0),
('f', 3),
('g', 4),
('h', 0.025599999353289604),
('i', 0.03239999711513519),
('j', -0.03205680847167969),
('k',0.10429033637046814),
('l',(34.190006256103516, 31.09000015258789, 31.099994659423828)),
('m',(-9.32000732421875, -9.32000732421875, -11.610000610351562))]
# this schema should work for all your cases
schema = StructType([
StructField("id", StringType(), False),
StructField("num_list", ArrayType(DoubleType(), True), True)
])
rdd = spark.sparkContext.parallelize(l).map(lambda r: (r[0], to_float_list(r[1])))
df = spark.createDataFrame(rdd, schema)
df.show(100, False)
def to_float_list(value):
if type(value) is tuple:
return list(map(float, value))
return [float(value)]
Notice that to_float_list function accepts either a tuple or a number and converts it to a list of double. This will output:
+---+-----------------------------------------------------------+
|id |num_list |
+---+-----------------------------------------------------------+
|a |[0.0] |
|b |[6.0] |
|c |[44.0] |
|d |[107.0] |
|e |[0.0] |
|f |[3.0] |
|g |[4.0] |
|h |[0.025599999353289604] |
|i |[0.03239999711513519] |
|j |[-0.03205680847167969] |
|k |[0.10429033637046814] |
|l |[34.190006256103516, 31.09000015258789, 31.099994659423828]|
|m |[-9.32000732421875, -9.32000732421875, -11.610000610351562]|
+---+-----------------------------------------------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With