Have some XML and regular text files that are north of 2 gigs. Loading the entire file into memory everytime I want to try something out in Spark takes too long on my machine.
Is there a way to read only a portion of the file (similar to running a SQL command against a large table and only getting a few rows without it taking forever)?
You can restrict the number of rows to n while reading a file by using limit(n).
For csv files it can be done as:
spark.read.csv("/path/to/file/").limit(n)
and text files as:
spark.read.text("/path/to/file/").limit(n)
Running explain on the obtained dataframes show that not the whole file is loaded, here with n=3 on an csv file:
== Physical Plan ==
CollectLimit 3
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With