When I run some of the Apache Spark examples in the Spark-Shell or as a job, I am not able to achieve full core utilization on a single machine. For example:
var textColumn = sc.textFile("/home/someuser/largefile.txt").cache()
var distinctWordCount = textColumn.flatMap(line => line.split('\0'))
                             .map(word => (word, 1))
                             .reduceByKey(_+_)
                             .count()
When running this script, I mostly see only 1 or 2 active cores on my 8 core machine. Isn't Spark supposed to parallelise this?
Spark does not require the users to have high end, expensive systems with great computing power. It splits the big data into multiple cores or systems available in the cluster and optimally utilizes these computing resources to the processes this data in a distributed manner.
It's easy to run locally on one machine — all you need is to have java installed on your system PATH , or the JAVA_HOME environment variable pointing to a Java installation. Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+.
Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines.
You can use local[*] to run Spark locally with as many worker threads as logical cores has your machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With