Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read first n rows without loading entire file?

Tags:

apache-spark

Have some XML and regular text files that are north of 2 gigs. Loading the entire file into memory everytime I want to try something out in Spark takes too long on my machine.

Is there a way to read only a portion of the file (similar to running a SQL command against a large table and only getting a few rows without it taking forever)?

like image 803
O.O Avatar asked Oct 30 '25 09:10

O.O


1 Answers

You can restrict the number of rows to n while reading a file by using limit(n).

For csv files it can be done as:

spark.read.csv("/path/to/file/").limit(n)

and text files as:

spark.read.text("/path/to/file/").limit(n)

Running explain on the obtained dataframes show that not the whole file is loaded, here with n=3 on an csv file:

== Physical Plan ==
CollectLimit 3
...
like image 191
Shaido Avatar answered Nov 02 '25 12:11

Shaido