I'm starting playing with Spark 2.0.1. New Dataset API is very clean, but I'm having problems with very simple operations.
Maybe I'm missing something, hope somebody can help.
These instructions
SparkConf conf = new SparkConf().setAppName("myapp").setMaster("local[*]");
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
Dataset<Info> infos = spark.read().json("data.json").as(Encoders.bean(Info.class));
System.out.println(infos.rdd().count());
produce a
java.lang.NegativeArraySizeException
and a fatal error detected by the JVM (1.8).
Working on the data using dataset api (i.e, selects, count on infos object) works fine.
How can I switch between Dataset and RDD?
In general this error comes when an application tries to create an array with negative size. see below example.
Its general java error. In your case I doubt this was caused by
Dataset<Info> infos = spark.read().json("data.json").as(Encoders.bean(Info.class));
System.out.println(infos.rdd().count());
you could review this code in which scenario, its negetively initializing, by printing complete stack trace.
~/
❯ cat StackTest.java
import java.util.*;
import java.io.*;
public class StackTest {
public static void main(String args[])throws IOException {
int c[]=new int[-2];
Scanner in=new Scanner(new InputStreamReader(System.in));
int b=in.nextInt();
int a[]=new int[b];
}
}
~/
❯ javac StackTest.java && java StackTest
Exception in thread "main" java.lang.NegativeArraySizeException: -2
at StackTest.main(StackTest.java:6)
Note: One of the use case is using
Kryo
serialization along with apache spark... when it can happen/fix is like below...Very large object graphs
Reference limits
Kryo
stores references in a map that is based on anint
array. Since Java array indices are limited toInteger.MAX_VALUE
, serializing large (> 1 billion) objects may result in ajava.lang.NegativeArraySizeException
.A workaround for this issue is disabling Kryo's reference tracking as indicated below:
Kryo kryo = new Kryo(); kryo.setReferences(false);
or else a property like spark.kryo.refferenceTrackingEnabled
=false
in spark-default.conf
or sparkConf
object if you want to set it programatically..
Spark docs says that
spark.kryo.referenceTracking
default value true
Whether to track references to the same object when serializing data with Kryo, which is necessary if your object graphs have loops and useful for efficiency if they contain multiple copies of the same object. Can be disabled to improve performance if you know this is not the case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With