When converting a timestamp between timezones in databricks/spark sql, the timezone itself seems lost in the end result, and I can't seem to either keep it or add it back.
I have a bunch of UTC times and am using the from_utc_timetamp() to convert them to a different timezone based on another field. The result is calculated correctly, but if I output it with a timezone it shows as UTC. It seems the conversion is done correctly but the end result has no timezone stored with it (affirmed by this answer), so it uses the server zone for the timezone in all cases.
Example: Using the following SQL:
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s Z") createTimestampLocal,
I get the following:

You can see that the third column has done the conversions correctly for the timezones, but the output itself still shows as being in UTC timezone.
Repeating this with a lowercase z in the date_format function shows the same; namely, the conversions occur but the end result is still treated as UTC.
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s z") createTimestampLocal,

I can also use an O in the format output instead of a Z or z, but this just gives me GMT instead of UTC; same output basically.
All the databricks documentation or stackoverflow questions I can find seem to treat printing timezones as a matter of setting the spark server time and outputting that way, or doing the conversion without keeping the resulting timezone. I'm trying to convert to multiple different timezones though, and to keep the timezone in the output. I need to generate the end result in this format:

Is there a way to do this? How do I either keep the timezone after the conversion or add it back in the format I need based on the timezone column I have? Given that the conversion works, and that I can output the end result with a +0000 on it, all the functionality to do this seems there, how do I put it together?
Spark does not support TIMESTAMP WITH TIMEZONE datatype as defined by ANSI SQL. Even though there are some functions that convert the timestamp across timezones, this information is never stored. Databricks documentation on timestamps explains:
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field identify a time instant in the UTC time zone, and where SESSION TZ is taken from the SQL config spark.sql.session.timeZone.
In your case spark.sql.session.timeZone is UTC and Z symbol in datetime pattern will always return UTC. Therefore you will never get a correct behavior with date_format if you deal with multiple timezones in a single query.
The only thing you can do is to explicitly store timezone information in a column and manually append it for display.
concat(
date_format(from_utc_timestamp(createTimestampUTC, v.timezone), "yyyy-MM-dd'T'HH:mm:s "),
v.timezone
) createTimestampLocal
This will display 2022-03-01T16:47:22.000 America/New_York. If you need an offset (-05:00) you will need to write a UDF to do the conversion and use Python or Scala native libraries that handle datetime conversions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With