Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Thrift Server for exposing big size file?

We have setup one Thrift Server with Spark 2.0 in Mesos Client mode.

When trying to query one 170 MB parquet file (select * from the table), it always failed with Java Out of Memory Exception (Java Heap Space). Even thought there are couple of Executor/Workers and executors' tasks are completed successfully (read from Spark UI).

Finally the query can be completed successfully when the JVM memory increased to 25 GB and Spark Driver Memory got 21GB! Seems the bottleneck is on the Driver memory itself.

Kryo serialization is used (spark.kryoserializer.buffer.max=1024m); and files are stored in S3 bucket; YARN is not used.

--Why does the Driver consume that much memory for such simple query?
--What other parameters/configuration can help to support large data sets and concurrent JDBC connections?

Thanks.

like image 309
yin yu Avatar asked Dec 03 '25 01:12

yin yu


1 Answers

Q1: Parquet files are compressed, when loaded into memory it will be decompressed. What's more, Java object including string has its overhead and if you have lots of small strings, the cost could be considerable.

Q2: Not sure about Spark 2.0, but for some previous version you could use incremental collect option to get results batch by batch.

like image 74
Paul Lam Avatar answered Dec 04 '25 18:12

Paul Lam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!