I know my DBFS path is backed by S3. Is there any utility/function to get the exact S3 path from a DBFS path? For example,
%python
required_util('dbfs:/user/hive/warehouse/default.db/students')
>> s3://data-lake-bucket-xyz/.......
I was going through a few other discussions, for example - What s3 bucket does DBFS use? How can I get the S3 location of a DBFS path but didn't get any useful answer
Alex is right regarding the default bucket for a workspace. The following answers may help for more specific versions of this question - the answer for mounts in dbfs is what I was hoping to find here.
*This answer mentions AWS s3 paths, it will apply equally to Azure adfs:
or GCP gs:
bucket paths
The question suggests you want to know the location of a database ( dbfs:/user/hive/warehouse/default.db
) or table within that database (students
) as opposed to wanting to just know the default bucket.
If the database has been created as a managed database (e.g. with a location
) then the following may shed light:
%sql
describe database extended <dbname_here>
Unfortunately that won't help for default
because it will have the location dbfs:/user/hive/warehouse
*I note this is pre-unity-catalog, we shouldn't be using default anymore
It might be that you created students
as an external table, if so then running this may show you an s3 path again if the table is managed in default it will have this location: dbfs:/user/hive/warehouse/<tablename_here>
If you are looking for the source location backing a mount e.g. within dbfs/mnt
then there is a way to do that using dbutils.fs.mounts()
For example this helper function will tell you the source of a dbfs mount:
from typing import Optional
def get_dbfs_mnt_source(mount: str) -> Optional[str]:
"""gets the string path source for a dbfs mount point"""
mnts = dbutils.fs.mounts()
source = None
for mnt in mnts:
if mnt.mountPoint == mount:
source = mnt.source
break
return source
#usage
s = get_dbfs_mnt_source('/mnt/s3-load-unload')
print(s)
This information isn't really available inside the execution context of the cluster. The closest that I can think about is to use Account REST API, but you need to be an account admin for that:
storage_configuration_id
field), and then use Get Storage Configuration API to retrieve bucket information.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With