Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error while I am using DataFrame show method in Pyspark

I try to show the Pyspark Dataframe, and I encounter such error:

Py4JJavaError: An error occurred while calling o607.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 114.0 failed 4 times, most recent failure: Lost task 0.3 in stage 114.0 (TID 15904, zw02-data-hdp-dn25211.mt, executor 416): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
    return lambda *a: f(*a)
  File "<ipython-input-12-2ecb67285c3b>", line 5, in <lambda>
  File "<ipython-input-12-2ecb67285c3b>", line 4, in convert_target
TypeError: int() argument must be a string, a bytes-like object or a number, not 'DenseVector'

This is my code, and it runs on Jupyter:

df2 = spark.sql(sql_text)
assembler = VectorAssembler(inputCols=["targetstep"], outputCol="x_vec")
scaler = MinMaxScaler(inputCol="x_vec", outputCol="targetstep_scaled")
pipeline = Pipeline(stages=[assembler, scaler])
scalerModel = pipeline.fit(df2)
df2 = scalerModel.transform(df2)
df2 = df2.withColumn('targetstep',target_udf(f.col('targetstep_scaled'))).drop('x_vec')
df2.show()

I'm sure that the Pipeline and withColumn() is ok. but I don't konw why the show method fails.

like image 971
yu song Avatar asked Oct 14 '25 08:10

yu song


1 Answers

PySpark DF are lazy loading.

When you call .show() you are asking the prior steps to execute and anyone of them may not work, you just can't see it until you call .show() because they haven't executed.

I go back to earlier steps and call .collect() on each operation of the DF. This will at least allow you to isolate where the bad data was created.

like image 103
DaveP Avatar answered Oct 17 '25 21:10

DaveP



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!