Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: saveAsTextFile not working correctly in Stand Alone Mode

Tags:

apache-spark

I wrote a simple Apache Spark (1.2.0) Java program to import a text file and then write it to disk using saveAsTextFile. But the output folder either has no content (just the _SUCCESS file) or at times has incomplete data (data from just 1/2 of the tasks ).

When I do a rdd.count() on the RDD, it shows the correct number, so I know the RDD correctly constructed, it is just the saveAsTextFile method which is not working.

Here is the code:

/* SimpleApp.java */
import java.util.List;

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

    public class SimpleApp {
     public static void main(String[] args) {
       String logFile = "/tmp/READ_ME.txt"; // Should be some file on your system
       SparkConf conf = new SparkConf().setAppName("Simple Application");
       JavaSparkContext sc = new JavaSparkContext(conf);
       JavaRDD<String> logData = sc.textFile(logFile);

       logData.saveAsTextFile("/tmp/simple-output");
       System.out.println("Lines -> " + logData.count());
    }
  }
like image 622
maxpayne Avatar asked Oct 29 '25 05:10

maxpayne


1 Answers

This is because you're saving to a local path. Are you running multiple machines? so, each worker is saving to its own /tmp directory. Sometimes, you have the driver executing a task so you get part of the result locally. Really you won't want to mix distributed mode and local file systems.

like image 147
Sean Owen Avatar answered Oct 31 '25 12:10

Sean Owen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!