I am new to Databricks and have the following doubt -
Databricks proposes 3 layers of storage Bronze (raw data), Silver (Clean data) and Gold (aggregated data).It is clear in terms of what these storage layers are meant to store. But my doubt is how are these actually created or identified. How do we specify when retrieving data from Silver or Gold. Are these different databases or different formats or anything else ?
Please help me in getting this concept clear.
These a logical layers:
bronze_df.filter("col1 is not null") and store results. Silver layer could be regenerated from the Bronze if you found error in your transformations, or was need to add an additional check. Silver layer is usually accessible by end users who need detailed data on the row levelDatabricks usually recommend to use Delta Lake for all these layers as it's easier to process data incrementally between layers, usually using the Structured Streaming. But you're not limited by that. I've seen many customers who output results of Gold layer into Azure SQL database, NoSQL databases, or something else, from which it could be consumed by applications that may work only with these systems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With