Just wondering how does Spark schedule jobs? In simple terms please, I have read many descriptions of how it does it but they were too complicated to understand.
At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.(Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker executes the tasks on the Slave.
look at this answer for more information
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With