Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.0.1 java.lang.NegativeArraySizeException

I'm starting playing with Spark 2.0.1. New Dataset API is very clean, but I'm having problems with very simple operations.

Maybe I'm missing something, hope somebody can help.

These instructions

SparkConf conf = new SparkConf().setAppName("myapp").setMaster("local[*]");
SparkSession spark = SparkSession
        .builder()
        .config(conf)
        .getOrCreate();

Dataset<Info> infos = spark.read().json("data.json").as(Encoders.bean(Info.class));

System.out.println(infos.rdd().count());

produce a

 java.lang.NegativeArraySizeException

and a fatal error detected by the JVM (1.8).

Working on the data using dataset api (i.e, selects, count on infos object) works fine.

How can I switch between Dataset and RDD?

like image 376
besil Avatar asked Sep 01 '25 22:09

besil


1 Answers

In general this error comes when an application tries to create an array with negative size. see below example.

Its general java error. In your case I doubt this was caused by

 Dataset<Info> infos = spark.read().json("data.json").as(Encoders.bean(Info.class));

System.out.println(infos.rdd().count());

you could review this code in which scenario, its negetively initializing, by printing complete stack trace.

~/
❯ cat StackTest.java
import java.util.*;
import java.io.*;

public class StackTest {
    public static void main(String args[])throws IOException {
        int c[]=new int[-2];
        Scanner in=new Scanner(new InputStreamReader(System.in));
        int b=in.nextInt();
        int a[]=new int[b];
    }
}

~/
❯ javac StackTest.java && java StackTest
Exception in thread "main" java.lang.NegativeArraySizeException: -2
        at StackTest.main(StackTest.java:6)

Note: One of the use case is using Kryo serialization along with apache spark... when it can happen/fix is like below...

Very large object graphs

Reference limits

Kryo stores references in a map that is based on an int array. Since Java array indices are limited to Integer.MAX_VALUE, serializing large (> 1 billion) objects may result in a java.lang.NegativeArraySizeException.

A workaround for this issue is disabling Kryo's reference tracking as indicated below:

  Kryo kryo = new Kryo();
  kryo.setReferences(false);

or else a property like spark.kryo.refferenceTrackingEnabled=false in spark-default.conf or sparkConf object if you want to set it programatically..

Spark docs says that

spark.kryo.referenceTracking default value true

Whether to track references to the same object when serializing data with Kryo, which is necessary if your object graphs have loops and useful for efficiency if they contain multiple copies of the same object. Can be disabled to improve performance if you know this is not the case.

like image 92
Ram Ghadiyaram Avatar answered Sep 05 '25 04:09

Ram Ghadiyaram