Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load to BigQuery Via Spark Job Fails with an Exception for Multiple sources found for parquet

I have a spark job that is loading data into BigQuery.The spark job runs in dataproc cluster. This is the snippet

df.write
      .format("bigquery")
      .mode(writeMode)
      .option("table",tabName)
      .save()

I have specified the spark bigquery dependency jar (spark-bigquery-with-dependencies_2.12-0.19.1.jar ) in --jars argument in the spark-submit command

When I am running the code I am getting the following exception java.lang.RuntimeException: Failed to write to BigQuery

Detailed error

Caused by: org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:717)

This are the dependencies in my project

<dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.14</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>2.4.8</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-bigquery</artifactId>
            <version>1.133.1</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.spark</groupId>
            <artifactId>spark-bigquery_2.12</artifactId>
            <version>0.21.1</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-storage</artifactId>
            <version>1.116.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.3.3</version>
        </dependency>
    </dependencies>

I am building an uber jar to run the spark job If , I remove the --jars param the job fails while reading a bigquery table

java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
like image 367
Ayan Biswas Avatar asked Oct 28 '25 04:10

Ayan Biswas


1 Answers

It seems you are using Spark 3.x with a jar that was compiled and includes spark 2.4.8 artifacts. The solution is simple: mark scala-library and spark-sql with the scope provided. Also, as you bring the spark-bigquery-connector externally, you don't need to add it to the code (as well as the google-cloud-* dependencies, unless you're using them directly)

like image 53
David Rabinowitz Avatar answered Oct 29 '25 20:10

David Rabinowitz