Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get data on spark jobs and stages from python [duplicate]

Following the breadcrumbs, I cobbled some code that seems to do what I want: run in the background, look at ongoing jobs, then collect... whatever information may be available:

def do_background_monitoring(sc: pyspark.context.SparkContext):
    thread = threading.Thread(target=monitor, args=[sc])
    thread.start()
    return thread

def monitor(sc: pyspark.context.SparkContext):
    job_tracker: pyspark.status.StatusTracker = sc.statusTracker() # should this go inside the loop?
    while True:
        time.sleep(1)
        for job_id in job_tracker.getActiveJobsIds():
            job: pyspark.status.SparkJobInfo = job_tracker.getJobInfo(job_id)
            stages = job.stageIds
            # ???

However, that's where I hit a dead end. According to the docs, stageIds is an int[], and apparently py4j or whatever doesn't know what to do with it? (py4j claims otherwise...)

ipdb> stages
JavaObject id=o34
ipdb> stages. 
          equals    notify    wait     
          getClass  notifyAll          
          hashCode  toString           
ipdb> stages.toString()
'[I@4b1009f3'

Is this a dead end? Are there other ways to achieve this? If I were willing to write scala to do this, would I be able to have just this bit be in Scala and keep the rest in Python?

like image 286
badp Avatar asked Oct 28 '25 08:10

badp


1 Answers

...while the repl made it look like Python knew nothing about the object other than it was some sort of Object, py4j does make the contents of the array available to you:

ipdb> type(si)
<class 'py4j.java_collections.JavaArray'>
ipdb> tuple(si)
(0,)

and now I feel really silly :)

like image 65
badp Avatar answered Oct 31 '25 00:10

badp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!