Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL Insert Select with a column list?

Tags:

apache-spark

As I read Spark/Hive SQL documentation is appears that Insert into table with a Column List is not supported in Spark 2.4 and earlier versions.

I have a source table and a destination table with different number of columns and different column names which I need to copy.

Does this mean I have to code this in PySpark to do this job as Spark SQL will not be able to do it ??

Example:

input_table( cola, colb, colc, cold, cole)

output_table(fieldx, fieldy, fieldz)

In SQL (assuming RDBMS such as MS-SQL, PostgreSQL etc) I would do the following:

insert into output_table(fieldx, fieldy, fieldz) select cola, colb, colc from input_table

Spark SQL does not allow this, it does not accept a column list in Insert SQL statement.

Question: how can I do this task with minimum of code and maximum performance in either PySpark or (ideally) in Spark-SQL (I am using Spark 2.4) ?

thank you

like image 703
Acid Rider Avatar asked Oct 14 '25 04:10

Acid Rider


1 Answers

Specify the columns in output that won't be copied from input_table as null in select. (This is what would happen when only a set of columns, not all, would be inserted with a column list, if it were allowed)

insert into output_table
select cola, colb, colc,null as other1,--..specify non-copied column values as null 
from input_table
like image 198
Vamsi Prabhala Avatar answered Oct 19 '25 02:10

Vamsi Prabhala