Best approach to implement inheritance in a data warehouse based on a postgres database

Question

I am developing a multi-step data pipeline that should optimize the following process:

1) Extract data from a NoSQL database (MongoDB).

2) Transform and load the data into a relational (PostgreSQL) database.

3) Build a data warehouse using the Postgres database

I have manually coded a script to handle steps 1) and 2), which is an intermediate ETL pipeline. Now my goal is to build the data warehouse using the Postgres database, but I came across with a few doubts regarding the DW design. Below is the dimensional model for the relational database:

enter image description here

There are 2 main tables, Occurrence and Canonical, from which inherit a set of others (drawn in red and blue, respectively). Note that there are 2 child data types, ObserverNodeOccurrence and CanonicalObserverNode, that have an extra many-to-many relationship with another table.

I made some research regarding how inheritance should be implemented in a data warehouse and figured the best practice would be to merge together the family data types (super and child tables) into a single table. Doing this would imply adding extra attributes and a lot of null values. My new dimensional model would look like the following:

enter image description here

Question 1: Do you think this is the best approach to address this problem? If not, what would be?

Question 2: Any software recommendations for on-premise data warehouses? (on-premise is a must since it contains sensitive data)

Laurenz Albe · Accepted Answer

Usually having fewer tables to join and denormalizing data will improve query performance for data warehouse queries, so they are often considered a good thing.

This would suggest your second table design. NULL values don't occupy any space in a PostgreSQL table, so you need not worry about that.

Marmite Bomber · Answer

As described here there are three options to implement inheritance in a relational database.

IMO the only practicable way to be used in data warehouse is the Table-Per-Hierarchy option, which merges all entities in one table.

The reason is not only the performance gain by saving the joins. In data warehouse often the historical view of the data is important. Think, how would you model a change in a subtype in some entity?

An important thing is to define a discriminator column which uniquely defines the source entity.

Best approach to implement inheritance in a data warehouse based on a postgres database

Tags:

postgresql

database-design

class-table-inheritance

data-warehouse

multi-table-inheritance

GRoutar

2 Answers

Laurenz Albe

Marmite Bomber

Recent Activity

Donate For Us

Best approach to implement inheritance in a data warehouse based on a postgres database

Tags:

postgresql

database-design

class-table-inheritance

data-warehouse

multi-table-inheritance

GRoutar

2 Answers

Laurenz Albe

Marmite Bomber

Related questions

Recent Activity

Donate For Us