I want to load data like path :
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-04/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-05/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-06/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-07/*/*
...
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-14/*/*`
this is my code
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"1[0-3]".r+"/*/*")`
and
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[4-9]".r+"/*/*")
either is ok,but
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[0-9]|1[0-4]".r+"/*/*")
doesn`t work
how should I write the path pattern to load 04-13 all the data
Try to use the following syntax for alternation:
{a,b} instead of (a|b)So in your case the load of the text file would be like the following:
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-{0[4-9],1[0-3]}/*/*")
to load all files from 2019-02-04 to 2019-02-13 subdirectories.
This is not exactly an answer more a best practice/suggestion, if you are able to control path syntax try to save your paths with date partitions:
hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190519
hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190418
.
.
.
hdfs://dcoshdfs/encrypt_data/gmap_info/date20160101
Than you can simply extract what ever you want using spark:
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info")`.where('date >= 20190204L && 'date <= 20190213L)
This is the most optimized approach since spark load exactly the data which it need and doesn't use partition discovery, pulse it is much more readable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With