Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing data stored in Redshift

We're currently using Redshift as our data warehouse, which we're very happy with. However, we now have a requirement to do machine learning against the data in our warehouse. Given the volume of data involved, ideally I'd want to run the computation in the same location as the data rather than shifting the data around, but this doesn't seem possible with Redshift. I've looked at MADlib, but this is not an option as Redshift does not support UDFs (which MADlib requires). I'm currently looking at shifting the data over to EMR and processing it with the Apache Spark machine learning library (or maybe H20, or Mahout, or whatever). So my questions are:

  1. is there a better way?
  2. if not, how should I make the data accessible to Spark? The options I've identified so far include: use Sqoop to load it into HDFS, use DBInputFormat, do a Redshift export to S3 and have Spark grab it from there. What are the pros/cons for these different approaches (and any others) when using Spark?

Note that this is off-line batch learning, but we'd like to be able to do this as quickly as possible so that we can iterate experiments quickly.

like image 219
deanj Avatar asked Jan 17 '26 09:01

deanj


1 Answers

If you'd like to query Redshift data in Spark and you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel. If you plan to run many different ML jobs on your Redshift data, then consider using spark-redshift to export it out of Redshift and save it to S3 in an efficient file format, such as Parquet.

Disclosure: I'm one of the authors of spark-redshift.

like image 101
Josh Rosen Avatar answered Jan 20 '26 03:01

Josh Rosen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!