Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delta Lake storage Layers - Concepts

I am new to Databricks and have the following doubt -

Databricks proposes 3 layers of storage Bronze (raw data), Silver (Clean data) and Gold (aggregated data).It is clear in terms of what these storage layers are meant to store. But my doubt is how are these actually created or identified. How do we specify when retrieving data from Silver or Gold. Are these different databases or different formats or anything else ?

Please help me in getting this concept clear.

like image 780
Mak Avatar asked Oct 18 '25 08:10

Mak


1 Answers

These a logical layers:

  • the Bronze layer stores the original data without modification - most common change is usually just changing the data format, like, take input data as CSV and store data as Delta. The main goal of having Bronze layer is to make sure that you have original data, and you can rebuild the Silver & Gold data if necessary, for example, if you found errors in your code that produces the Silver layer. The necessity of having the Bronze layer heavily dependent on the source of the data. For example, if your data is coming from some database, then you can expect that data there is already clean, in this case you can ingest them directly into Silver layer. Bronze layer usually isn't accessed directly by end users
  • the Silver layer is created from Bronze by applying some transformations, enrichment, and cleanup procedures. For example, if data in some column must be non-null, or be in a certain range, you can add code like bronze_df.filter("col1 is not null") and store results. Silver layer could be regenerated from the Bronze if you found error in your transformations, or was need to add an additional check. Silver layer is usually accessible by end users who need detailed data on the row level
  • the Gold layer is usually some kind of aggregated data that will be used for reporting, dashboards, etc. There could be multiple tables in the Gold layer generated from one or more Silver tables.

Databricks usually recommend to use Delta Lake for all these layers as it's easier to process data incrementally between layers, usually using the Structured Streaming. But you're not limited by that. I've seen many customers who output results of Gold layer into Azure SQL database, NoSQL databases, or something else, from which it could be consumed by applications that may work only with these systems.

like image 54
Alex Ott Avatar answered Oct 22 '25 04:10

Alex Ott



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!