Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a PySpark Schema for a list of tuples?

What should be the correct PySpark Schema for the following list of tuples? I want to apply the schema on the following data:

[('a', 0.0), ('b', 6), ('c', 44), ('d', 107), ('e', 0), ('f', 3), ('g', 4), ('h', 0.025599999353289604), ('i', 0.03239999711513519), ('j', -0.03205680847167969), ('k', 0.10429033637046814), ('l', (34.190006256103516, 31.09000015258789, 31.099994659423828)), ('m', (-9.32000732421875, -9.32000732421875, -11.610000610351562)) ]

I want the result in the following format: Format

like image 245
Ankan Dutta Avatar asked Dec 05 '25 14:12

Ankan Dutta


1 Answers

Tanjin answer should work although I would like to suggest another approach. Instead of finding out how many columns you should add to your schema to create a column of type array/list. The next code transforms your data into a rdd which instead of tuples contains rows of [key, value] where value is a list of double. Then you can easily apply the schema below.

def test():
    l = [('a', 0.0), 
    ('b', 6), 
    ('c', 44), 
    ('d', 107), 
    ('e', 0), 
    ('f', 3), 
    ('g', 4), 
    ('h', 0.025599999353289604), 
    ('i', 0.03239999711513519), 
    ('j', -0.03205680847167969), 
    ('k',0.10429033637046814), 
    ('l',(34.190006256103516, 31.09000015258789, 31.099994659423828)), 
    ('m',(-9.32000732421875, -9.32000732421875, -11.610000610351562))]

    # this schema should work for all your cases 
    schema = StructType([
        StructField("id", StringType(), False),
        StructField("num_list", ArrayType(DoubleType(), True), True)
    ])

    rdd = spark.sparkContext.parallelize(l).map(lambda r: (r[0], to_float_list(r[1])))

    df = spark.createDataFrame(rdd, schema)

    df.show(100, False)

def to_float_list(value):
    if type(value) is tuple:  
        return list(map(float, value))

    return [float(value)]

Notice that to_float_list function accepts either a tuple or a number and converts it to a list of double. This will output:

+---+-----------------------------------------------------------+
|id |num_list                                                   |
+---+-----------------------------------------------------------+
|a  |[0.0]                                                      |
|b  |[6.0]                                                      |
|c  |[44.0]                                                     |
|d  |[107.0]                                                    |
|e  |[0.0]                                                      |
|f  |[3.0]                                                      |
|g  |[4.0]                                                      |
|h  |[0.025599999353289604]                                     |
|i  |[0.03239999711513519]                                      |
|j  |[-0.03205680847167969]                                     |
|k  |[0.10429033637046814]                                      |
|l  |[34.190006256103516, 31.09000015258789, 31.099994659423828]|
|m  |[-9.32000732421875, -9.32000732421875, -11.610000610351562]|
+---+-----------------------------------------------------------+
like image 75
abiratsis Avatar answered Dec 08 '25 08:12

abiratsis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!