I understand there are good reasons for using surrogate keys in data warehouse dimensions.
Still, I do not understand how I can link them to my fact table's foreign keys.
In the fact table I have only natural keys, extracted during ETL. Surrogate keys are not present in the original database tables.
Any suggestions on this? Thank you
There are several "see" references below. This is my first answer on Stack Overflow, so I don't have enough reputation points to provide the links to you yet. If you lookup those terms on Wikipedia, they provide a more eloquent description of those things than I could.
In data warehouses I have worked with we typically store the surrogate keys that reference the various dimensions in the fact table. In fact, I avoid storing the natural keys from the source systems in the fact table except in special circumstances (e.g., degenerate dimensions). There are a couple of reasons for this:
- Join efficiency - some source systems might not use a simple integer key; using a surrogate key allows you to reduce this complexity so that your data warehouse query performance is better since it only has to deal with a single column integer join.
- Abstraction of your fact table from the source system; your fact table might outlive a particular source system (or version of a source system) or the facts may be coming from different source systems with different natural keys. Regardless of natural key design, the relationship between the fact table and the dimension table remains the same.
- Accurate and efficient point-in-time facts - If history of attributes in your dimensions are important, you can use the surrogate key to allow you to store historical copies of your dimension rows and attach the correction version to your fact table row (see Slowly Changing Dimension; especially Type 2)
- The dimension may be used by multiple fact tables from multiple source systems or my be consolidated from multiple source systems in which case there will not be a simple relationship between source system natural key and the dimension surrogate key (see Conformed Dimensions)
- Unknowns - Sometimes you can have facts where the natural key is NULL, an invalid value, or some weirdness. You can represent that condition and still maintain referential integrity in the database by having one of more special rows in your dimension table to represent Unknown, Invalid, Hasn't Happened Yet, or whatever is appropriate. (Technically, a value of NULL can't be a key, but some database engines will let you get away with it at the cost of performance and usability of your data warehouse)
- I'm sure I'm forgetting some really important one...
Generally, during the transform phase of loading the fact table, I lookup the surrogate key for the natural key coming from the source system and then store the surrogate key in the fact table instead of the natural key. I don't know what platform you're on, you could use JOINs on most any database platform to do this. I use SSIS lookups frequently on the Microsoft SQL Server platform.