Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how should I express the hdfs path in spark textfile?

I want to load data like path :

hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-04/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-05/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-06/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-07/*/*
...
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-14/*/*`

this is my code

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"1[0-3]".r+"/*/*")`

and

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[4-9]".r+"/*/*")

either is ok,but

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[0-9]|1[0-4]".r+"/*/*")

doesn`t work

how should I write the path pattern to load 04-13 all the data

like image 509
安安朱 Avatar asked Dec 21 '25 21:12

安安朱


2 Answers

Try to use the following syntax for alternation:

  • {a,b} instead of (a|b)

So in your case the load of the text file would be like the following:

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-{0[4-9],1[0-3]}/*/*")

to load all files from 2019-02-04 to 2019-02-13 subdirectories.

like image 182
pheeleeppoo Avatar answered Dec 23 '25 14:12

pheeleeppoo


This is not exactly an answer more a best practice/suggestion, if you are able to control path syntax try to save your paths with date partitions:

hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190519
hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190418
.
.
.
hdfs://dcoshdfs/encrypt_data/gmap_info/date20160101

Than you can simply extract what ever you want using spark:

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info")`.where('date >= 20190204L && 'date <= 20190213L)

This is the most optimized approach since spark load exactly the data which it need and doesn't use partition discovery, pulse it is much more readable.

like image 30
RefiPeretz Avatar answered Dec 23 '25 12:12

RefiPeretz