Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read Azure Table Storage data from Apache Spark running on HDInsight

Is it any way of doing that from a Spark application running on Azure HDInsight? We are using Scala.

Azure Blobs are supported (through WASB). I don't understand why Azure Tables aren't.

Thanks in advance

like image 623
Jose Parra Avatar asked Nov 16 '25 17:11

Jose Parra


2 Answers

You can actually read from Table Storage in Spark, here's a project done by a Microsoft guy doing just that:

https://github.com/mooso/azure-tables-hadoop

You probably won't need all the Hive stuff, just the classes at root level:

  • AzureTableConfiguration.java
  • AzureTableInputFormat.java
  • AzureTableInputSplit.java
  • AzureTablePartitioner.java
  • AzureTableRecordReader.java
  • BaseAzureTablePartitioner.java
  • DefaultTablePartitioner.java
  • PartitionInputSplit.java
  • WritableEntity.java

You can read with something like this:

import org.apache.hadoop.conf.Configuration

sparkContext.newAPIHadoopRDD(getTableConfig(tableName,account,key),
                                                classOf[AzureTableInputFormat],
                                                classOf[Text],
                                                classOf[WritableEntity])

def getTableConfig(tableName : String, account : String, key : String): Configuration = {
    val configuration = new Configuration()
    configuration.set("azure.table.name", tableName)
    configuration.set("azure.table.account.uri", account)
    configuration.set("azure.table.storage.key", key)
    configuration
  }

You will have to write a decoding function to transform your WritableEntity to the Class you want.

It worked for me!

like image 188
Lucian Avatar answered Nov 18 '25 10:11

Lucian


Currently Azure Tables are not supported. Only Azure blobs support the HDFS interface required by Hadoop & Spark.

like image 31
Asad Khan Avatar answered Nov 18 '25 10:11

Asad Khan