How to add partitioning to existing Iceberg table

Question

How to add partitioning to existing Iceberg table which is not partitioned? Table is loaded with data already.

Table was created:

import org.apache.iceberg.hive.HiveCatalog
import org.apache.iceberg.catalog._
import org.apache.iceberg.spark.SparkSchemaUtil
import org.apache.iceberg.PartitionSpec
import org.apache.spark.sql.SaveMode._

val df1 = spark
  .range(1000)
  .toDF
  .withColumn("level",lit("something"))

val catalog = new HiveCatalog(spark.sessionState.newHadoopConf())

val icebergSchema = SparkSchemaUtil.convert(df1.schema)

val icebergTableName = TableIdentifier.of("default", "icebergTab")

val icebergTable = catalog
  .createTable(icebergTableName, icebergSchema, PartitionSpec.unpartitioned)

Any suggestions?

Ryan Blue · Accepted Answer

Right now, the way to add partitioning is to update the partition spec manually.

val table = catalog.loadTable(tableName)
val ops = table.asInstanceOf[BaseTable].operations
val spec = PartitionSpec.builderFor(table.schema).identity("level").build

val base = ops.current
val newMeta = base.updatePartitionSpec(spec)
ops.commit(base, newMeta)

There is a pull request to add an operation to make changes, like addField("level"), but that isn't quite finished yet. I think it will be in the 0.11.0 release.

Keep in mind:

After you change the partition spec, the existing data files will have null values in metadata tables for the partition fields. That doesn't mean that the values would have been null if the data were written with the new spec, just that the metadata doesn't have the values for existing data files.
Dynamic partition replacement will have a different behavior in the new spec because the granularity of a partition is different. Without a spec, INSERT OVERWRITE will replace the whole table. With a spec, just the partitions with new rows will be replaced. To avoid this, we recommend using the DataFrameWriterV2 interface in Spark, where you can be more explicit about what data values are overwritten.

How to add partitioning to existing Iceberg table

Tags:

scala

apache-spark

apache-spark-sql

apache-iceberg

domisj

1 Answers

Ryan Blue

Recent Activity

Donate For Us

How to add partitioning to existing Iceberg table

Tags:

scala

apache-spark

apache-spark-sql

apache-iceberg

domisj

1 Answers

Ryan Blue

Related questions

Recent Activity

Donate For Us