In which the author discloses a pet peeve and lays out the minimum necessary understanding of set theory for query writing and relational database design.
While the concepts had antecedents in the form of thoughts and papers written by mathematical thinkers, the generally accepted origin of relational database theory is a paper written by a mathematician or engineer at IBM, E. F. Codd, and published in the Communications of the Association of Computing Machinery in June of 1970. Entitled “A Relational Model of Data for Large Shared Data Banks”. codd.pdf
The paper is written in mathematical language and not very accessible to a non-mathematician. Fortunately the concepts are laid out in simpler fashion in subsequent publications such as the one by Chris Date that I previously mentioned, and I imagine in hundreds of computer science textbooks by now.
In the olden days, I attended a relational database session in which LargeFinCo brought in an expert to give the development staff one day’s worth of relational database theory. He started out by explaining that it was called ‘relational database theory’ because the tables had relationships between them. I prepared to object, but another programmer beat me to it, and explained that is not the reason its called relational, and in fact the relation is a mathematical analogue to a table. Its a minor point, really a point of style, but perhaps something to whip out at some point during a heated debate over a relational database design, or to embarrass an instructor with only a superficial understanding of his course material—which we ought not do as that would be mean.
Codd was a mathematician, and a “relation” is a mathematical object with many attributes and applications. For the purposes of our discussion, a relation can be thought of as a set of name-value pairs which can form the basis of something that can be represented as a table. A precise mathematical description of a database table can be expressed as “a set of n-tuples”, but we will get to that later.
The foundational building block of relational database theory is the mathematical concept of the set. While I proudly display the book Naive Set Theory by Halmos in my library, one need not go to such extremes. I do commend you to some fuller explanations of set theory available on the interweb and found by your favorite search engine. There are a few things you must memorize and understand at an intuitive level to work with relational databases—I believe that a full enough explanation would run less than 10 pages. Most of these things people learn by working with this stuff, but I believe a more formal grounding brings out some nuance that one might not pick up on the job.
Here are the concepts that I believe are fundamental to relational database theory, and extremely general in application to many aspects of computer technology:
Definition of a set: Mathematically a set may be any arbitrary collection of designated objects. For instance the set of integers greater than 3, or the letters in the english alphabet. When I talk about sets here, I will always be referring to sets of identical mathematical objects. This will be explained as we go through it.
Union of sets: The set of elements contained in two sets with duplicates eliminated. For instance the union of {a,b,c,d} and {d,e,f} will be {a,b,c,d,e,f}. The letter d is in both sets but only included once in the resulting set. Note that set operations are “closed on sets’, that is sets are input to the operation and the output is a set.
Intersection of sets: The set of elements both sets have in common. For instance the union of {a,b,c,d} and {d,e,f} will be {d}.
Subtraction of sets. The removal from one set the elements of the set that exist in another set. For instance subtraction of {d,e,f} from {a,b,c,d} will be {a,b,c}. This can also be expressed as identifying the intersection of two sets and removing the intersection from the target set.
Multiplication of sets: “Cartesian product”. Create a set containing all possible combination of the elements in two sets. For example, the multiplication of {1,2,3} and {a,b} results in {1a, 1b, 2a, 2b,3a,3b}. Multiplication of large sets yields very large results.
Partition of a set: Identifying subsets of a set such that a union of the subsets yields the entire set, and an intersection of the subsets yields an empty set. One common partitioning scheme is the partitioning of the continental United States into individual states and the District of Columbia which is a federal animal and not a state, strictly speaking.
These concepts and set operations are ubiquitous in SQL query writing and relational database design.