Spark Thrift Server for exposing big size file?

Question

We have setup one Thrift Server with Spark 2.0 in Mesos Client mode.

When trying to query one 170 MB parquet file (select * from the table), it always failed with Java Out of Memory Exception (Java Heap Space). Even thought there are couple of Executor/Workers and executors' tasks are completed successfully (read from Spark UI).

Finally the query can be completed successfully when the JVM memory increased to 25 GB and Spark Driver Memory got 21GB! Seems the bottleneck is on the Driver memory itself.

Kryo serialization is used (spark.kryoserializer.buffer.max=1024m); and files are stored in S3 bucket; YARN is not used.

--Why does the Driver consume that much memory for such simple query?
--What other parameters/configuration can help to support large data sets and concurrent JDBC connections?

Thanks.

Paul Lam · Accepted Answer

Q1: Parquet files are compressed, when loaded into memory it will be decompressed. What's more, Java object including string has its overhead and if you have lots of small strings, the cost could be considerable.

Q2: Not sure about Spark 2.0, but for some previous version you could use incremental collect option to get results batch by batch.

Spark Thrift Server for exposing big size file?

Tags:

java

memory

driver

apache-spark

thrift

yin yu

1 Answers

Paul Lam

Recent Activity

Donate For Us

Spark Thrift Server for exposing big size file?

Tags:

java

memory

driver

apache-spark

thrift

yin yu

1 Answers

Paul Lam

Related questions

Recent Activity

Donate For Us