In which the reader, having been intellectually prepared with a review of set theory, gets their first glimpse at relational database design.
To qualify as a table in a relational database, Chris Date states that it must have a primary key. I agree. While a DBMS like Sql Server will allow you to insert a set of rows into a ‘table’, its not really a relational table. Not everybody is hip to this concept, and I have seen vendor developed systems with hundreds of tables that have no primary key defined. These sorts of systems are not able to take advantage of the benefits relational theory provides, but for many entrepeneurs, the easiest way to make money selling software is to pay as little as possible to developers.
A ‘relational’ table must contain the ability to uniquely identifiy a single row in the table. A column or combination of columns must be unique for each row. Occasionally a table may contain more than one set of columns that can uniquely specify a row. These are called candidate keys, and one must be selected to be a primary key.
Ensuring that each table has a primary key is what I refer to as “Primary key discipline”. As far as what comprises the primary key, there are generally two types of keys—natural and artificial. Which type works best for you depends on the individual situation.
A natural key is one that is composed of column(s) that exist in the data in the row. For instance, in one of our favorite examples consider the invoice that has multiple product sales on it. An invoice table will have the invoice number, and using that for the primary key produces a natural type primary key. I have encountered situations where all but one of the columns makes a row unique, and that last column might contain a quantity or other attribute. Wide natural keys may have performance implications when queries are executed, but its extremely important to understand the logical aspect of this first, and once that is all mastered, one may lift up the hood and look at query performance.
When a primary key has more than one column, its referred to as a composite key.
An artificial key is composed of a column or columns added to the existing columns ‘data’ of a table. The most common implementation of this is an incrementing integer assigned at the time a row is inserted. In SQL Server, this is provided by an “identity” type column that takes care of the incrementing value for you. Another value for an artificial key may be a GUID, a Globally Unique Identifier. These are many bytes wide, and in a circumstance where there are millions of rows, the smaller size of an integer might be more performant. I believe the nature of the artificial key is such that designing a scheme where a table has more than one artificial key seems to defeat the advantage of a single small key that an artificial key can provide.
While I used to much prefer a natural key when one or two columns would do the trick, artifical keys can make certain things simpler. For instance, when a key is 7 columns wide using it in queries can be cumbersome. For instance the code to determine when one composite key is greater than another, and the values have a hierarchy, such as year, month, day, product sequence number require a lot of code to determine which is greater. Dates are normally treated as a single column to simplify this type of thing, but the example helps us think of a composite key with a hierarachy of values. For instance, if the year is greater, the values for month are not important. If the year is the same, the month needs to be considered, and so on down the hierarchy.
An artificial key allows attributes to be changed without the key changing. For instance, I believe that in general Azure cloud objects are guid-indexed. Because of this, I can rename an entra group without it being a breaking change. All references to the group, such as the list of groups an employee belongs to, automatically display the new name. Were they key the group name, changing it would be a breaking change to other components of the database that refer to the group by name.
Another concept, perhaps advanced, and certainly nuanced, is whether table is “ledgered” This word comes from the olden days when companies used paper accounting ledgers. If there was an error in an accounting entry, or an adjustment to it, a correcting entry is added to the ledger rather than changing the original entry. A good definition of a ledgered table is one where inserted rows are never changed. Knowing this definition created an awareness of the concept and allows precise and compact communication. I might ask a developer, “Is the table ledgered?” When a table is ledgered, one can create an incremental copy of the data by simply keeping track of the previous “high water mark”—the highest key value that was copied in the previous extract process. Datetime may also work as a high water mark, depending on the nature of the data.
I once worked with some financial analysts who worked with data we copied from a large vendor database (no primary keys there) into their own sql server where they queried it. This was financial data—an accounting ledger in fact. To copy the entirety of the vendor table into the analysts server took the better part of a day. So, we set up a scheme of incremental loads. Well, turns out the vendor mostly did not go back and change entries, but on occasion they did. So it was sort of mostly ledgered. We found the incrementally loaded data was good enough so we did daily incrementals and every few weeks we would set aside some time and run the full load which brought all the changed entries up to date.
There is a lot of wisdom here, and it probably warrants multiple readings for complete understanding. The main thing to know on first brush is that primary key discipline is essential to rigorous and systematic relational database design.