Editing Extract, transform, load (section)

=== Uniqueness of keys ===

[[Unique key]]s play an important part in all relational databases, as they tie everything together. A unique key is a column that identifies a given entity, whereas a [[foreign key]] is a column in another table that refers to a primary key. Keys can comprise several columns, in which case they are composite keys. In many cases, the primary key is an auto-generated integer that has no meaning for the [[Business entity (computer science)|business entity]] being represented, but solely exists for the purpose of the relational database – commonly referred to as a [[surrogate key]].

As there is usually more than one data source getting loaded into the warehouse, the keys are an important concern to be addressed. For example: customers might be represented in several data sources, with their [[Social Security number]] as the primary key in one source, their phone number in another, and a surrogate in the third. Yet a data warehouse may require the consolidation of all the customer information into one [[Dimension (data warehouse)|dimension]].

A recommended way to deal with the concern involves adding a warehouse surrogate key, which is used as a foreign key from the fact table.<ref>Kimball, The Data Warehouse Lifecycle Toolkit, p. 332</ref>

Usually, updates occur to a dimension's source data, which obviously must be reflected in the data warehouse.

If the primary key of the source data is required for reporting, the dimension already contains that piece of information for each row. If the source data uses a surrogate key, the warehouse must keep track of it even though it is never used in queries or reports; it is done by creating a [[lookup table]] that contains the warehouse surrogate key and the originating key.<ref name="Rizzi, Data Warehouse Design p. 291">Golfarelli/Rizzi, Data Warehouse Design, p. 291</ref> This way, the dimension is not polluted with surrogates from various source systems, while the ability to update is preserved.

The lookup table is used in different ways depending on the nature of the source data.
There are 5 types to consider;<ref name="Rizzi, Data Warehouse Design p. 291"/> three are included here:
;Type 1
:The dimension row is simply updated to match the current state of the source system; the warehouse does not capture history; the lookup table is used to identify the dimension row to update or overwrite
;Type 2
:A new dimension row is added with the new state of the source system; a new surrogate key is assigned; source key is no longer unique in the lookup table
;Fully logged
:A new dimension row is added with the new state of the source system, while the previous dimension row is updated to reflect it is no longer active and time of deactivation.