Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract tables with data from .sql dumps using Spark?

I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.

I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )

I directly read the file into spark using

import org.apache.spark.sql.SparkSession

var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")

//Convert this to indexed dataframe so you can parse multiple line create / data statements. 
//This will also show you the structure of the sql dump for your usecase.

var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text") 


// Identify all tables and data in the sql dump along with their indexes

var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))

println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())

var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))

The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?

Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?

UPDATE 1: I am writing the parser for this based on Input given by Ajay

UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested

like image 861
Leothorn Avatar asked Oct 30 '25 22:10

Leothorn


1 Answers

Is there any .sql parser written for scala spark ?

Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?

The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?

Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.

The ParserInterface is available as sqlParser of a SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface

Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.


shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.

like image 133
Jacek Laskowski Avatar answered Nov 01 '25 21:11

Jacek Laskowski



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!