Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using JSON Path in Spark SQL

While working with nested JSON in Spark SQL, can JSON path be used to extract data from JSON?

Ex:

{
    "store": {
        "book": [
            {
                "category": "reference",
                "author": "Nigel Rees",
                "title": "Sayings of the Century",
                "price": 8.95
            },
            {
                "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "Sword of Honour",
                "price": 12.99
            },
            {
                "category": "fiction",
                "author": "Herman Melville",
                "title": "Moby Dick",
                "isbn": "0-553-21311-3",
                "price": 8.99
            },
            {
                "category": "fiction",
                "author": "J. R. R. Tolkien",
                "title": "The Lord of the Rings",
                "isbn": "0-395-19395-8",
                "price": 22.99
            }
        ],
        "bicycle": {
            "color": "red",
            "price": 19.95
        }
    },
    "expensive": 10
}

To select authors where book category is reference using json path, I can use $.store.book[?(@.category == 'reference')].author.

Is it possible to define a Spark table using such operations?

like image 852
Abhay Dubey Avatar asked Dec 07 '25 17:12

Abhay Dubey


1 Answers

I am not aware of a way of using JSONPath to do what you want, but here is how I would do it:

  1. explode the array,
  2. query on the records.

Note that you do not have to explode the entire array.

Code:

package net.jgp.books.sparkInAction.ch12.lab920QueryOnJson;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;

/**
 * Using JSONpath-like in SQL queries.
 * 
 * @author jgp
 */
public class QueryOnJsonApp {

  /**
   * main() is your entry point to the application.
   * 
   * @param args
   */
  public static void main(String[] args) {
    QueryOnJsonApp app = new QueryOnJsonApp();
    app.start();
  }

  /**
   * The processing code.
   */
  private void start() {
    // Creates a session on a local master
    SparkSession spark = SparkSession.builder()
        .appName("Query on a JSON doc")
        .master("local")
        .getOrCreate();

    // Reads a JSON, stores it in a dataframe
    Dataset<Row> df = spark.read()
        .format("json")
        .option("multiline", true)
        .load("data/json/store.json");

    // Explode the array
    df = df
        .withColumn("items", functions.explode(df.col("store.book")));

    // Creates a view so I can use SQL
    df.createOrReplaceTempView("books");
    Dataset<Row> authorsOfReferenceBookDf =
        spark.sql("SELECT items.author FROM books WHERE items.category = 'reference'");
    authorsOfReferenceBookDf.show(false);
  }
}

I think it's a great question, I added it to chapter 12 of Spark in Action.

like image 152
jgp Avatar answered Dec 11 '25 16:12

jgp