Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get the S3 location of a Databricks DBFS path

I know my DBFS path is backed by S3. Is there any utility/function to get the exact S3 path from a DBFS path? For example,

%python
required_util('dbfs:/user/hive/warehouse/default.db/students')
>> s3://data-lake-bucket-xyz/.......

I was going through a few other discussions, for example - What s3 bucket does DBFS use? How can I get the S3 location of a DBFS path but didn't get any useful answer

like image 434
soumya-kole Avatar asked Sep 12 '25 08:09

soumya-kole


2 Answers

Alex is right regarding the default bucket for a workspace. The following answers may help for more specific versions of this question - the answer for mounts in dbfs is what I was hoping to find here.

*This answer mentions AWS s3 paths, it will apply equally to Azure adfs: or GCP gs: bucket paths

Database or table location

The question suggests you want to know the location of a database ( dbfs:/user/hive/warehouse/default.db) or table within that database (students) as opposed to wanting to just know the default bucket.

If the database has been created as a managed database (e.g. with a location) then the following may shed light:

%sql
describe database extended <dbname_here>

Unfortunately that won't help for default because it will have the location dbfs:/user/hive/warehouse *I note this is pre-unity-catalog, we shouldn't be using default anymore

It might be that you created students as an external table, if so then running this may show you an s3 path again if the table is managed in default it will have this location: dbfs:/user/hive/warehouse/<tablename_here>

Mounts

If you are looking for the source location backing a mount e.g. within dbfs/mnt then there is a way to do that using dbutils.fs.mounts()

For example this helper function will tell you the source of a dbfs mount:

from typing import Optional
def get_dbfs_mnt_source(mount: str) -> Optional[str]:
  """gets the string path source for a dbfs mount point"""
  mnts = dbutils.fs.mounts()
  source = None
  for mnt in mnts:
    if mnt.mountPoint == mount:
      source = mnt.source
      break
  return source

#usage
s = get_dbfs_mnt_source('/mnt/s3-load-unload')
print(s)
like image 159
Davos Avatar answered Sep 14 '25 23:09

Davos


This information isn't really available inside the execution context of the cluster. The closest that I can think about is to use Account REST API, but you need to be an account admin for that:

  • You can get information about specific workspace using GET Workspace API
  • From result of this GET request, you can get storage configuration ID (the storage_configuration_id field), and then use Get Storage Configuration API to retrieve bucket information.
like image 34
Alex Ott Avatar answered Sep 14 '25 23:09

Alex Ott