Musings
on Database Normalization
Much
of the work I do has to do with formatting the presentation layer of
data. The typical data flow is (a) vendor sends data, (b) we
transform and store it in a format somewhere between what they send
us as what is required by the display later (website) and (c)
procedures are written against this data.
Having
the luxury of placing the burden of data integrity on the vendor, in
combination with an asynchronous replication model means that most of
the time we cannot and do not use surrogate keys or foreign keys.
Some
of the time this works fine; actually most of the time it works fine.
Of course if the data
consumption
architecture deviates
from the simple truncate/insert
correcting
data can be vexing.
I
had my first taste of self-inflicted bad database design on a recent
project. Hindsight being what it is, I would have done many things
differently (not the least of which would have been more planning
before developing), but as it is, it was a good learning experience.
I
was given a data set which I didn't understand well and a limited
amount of time to develop against it. The data sets were effectively
fact tables from a data warehouse; some with provided dimension files
and some without. (By the way, for reasons I wont get into SSAS was
not an option to use). The data sets were also quite large (fact
tables around 100 million records with all the different dimensions
factored in). To add insult to injury, the default format the data
came in was name-value pairs rather than columns. (i.e. {Name: Price,
Value: 3.14} as opposed to Price: 3.14}.
Without
using surrogate keys, a key might easily have 8 columns in the
primary key including a generic “Name” column catapulting the
number of records by a factor of however many members there were in
the table.
There
were quite a few revisions I undertook to optimize both retrieval and
storage. The first was to do away with the NVP structure and use
dedicated columns. Sure, new values required a little bit of dev work
to hook up, but even if it were a schema-less NVP format, some amount
of work would still need to be done to hook up the new data points.
So that cut table sizes dramatically. In combination with data
compression, the data set was not looking quite manageable.
But
that’s not the main point I want to make (although it is a good
example of why NVP models suck).
The
next step I took (and what I should have done from the beginning) was
to really get to know the data, understand the natural hierarchies
and unique entities. Since not all of this data was provided as
reference data by the vendor, I built a process to build inferred
dimension from the incoming files and maintain surrogate identifiers
as necessary.
What
I ended up was two conceptual layers in the database. A decomposed
and normalized version of the fact tables the vendor sent, and a set
of custom pre-aggregated tables to serve specific use surface data
quickly (where indexing alone was insufficient). The main advantage
of the normalized structure was the amount of work it removed from
building and maintaining these “second-level” aggregate data
structures. Being able to reference master records of data ahead of
time, and logically know where to look for everything made the
process of pre-populating tables significantly easier. Plus it put
into my hands much more control over the integrity of the data.
The
next project I worked on was a departure from the model I described
in the first paragraph (vendor -> store -> present). Instead I
built a database drawing upon both internal and external data sets. I
used many lessons from the project I described earlier. First and
foremost, I took (had?) the time to plan out the database ahead of
time. Wherever possible, I normalized things out; in a few cases even
so far as to have references tables with one or two id/value rows in
them. I created surrogate identifiers when the need to standardize
disparate data sources was needed, I documented the database, and
made use of foreign keys for reasonably static reference data.
First
the bad. Although to an extent, surrogate keys are supposed to be
business-process agnostic, there was a need to be able to uniquely
identify entities which had no natural key. As such, when setting up
new items, while the value of the surrogate key was irrelevant, those
identifiers made their way into descriptive models and processes.
There was distinct logic written around EntityID:1 vs EntityID:2.
This means that moving data from environment to environment, updating
values and correcting data, these identifiers have to be maintained.
MyFirstEntity must always be ID:1, and MySecondEntity must always be
ID:2, not just randomly assigned numbers. Perhaps there’s a better
way to uniquely identify elements with no natural key, but short of
fuzzy logic, I’m not aware of one.
Additionally,
foreign keys, while central to database integrity, tie your hands a
bit in terms of how you can modify an immature database. For
instance, during early development cycles, I was changing table
structures every few day or so. Foreign Keys did their job admirably
and prevented me from taking shortcuts to restructure the database
and data; objectively a good thing, but it takes some getting used
to.
Now
the good. As with the first project I mentioned, I had to build
several consolidated tables in order for front-end queries to perform
optimally, but having everything structured logically made building
the new structures embarrassingly simple once the foundation was
configured. Highly (maybe 3NF) normalized databases requiring a wide
variety of complex query format does present probme problems for
systems with low latency tolerance, not all of which can be
ameliorated through indexing and query design alone, but again,
building de-normalized structures off of the normalized data, over
which I had full control, was simple compared to trying to
maintain 20 de-normalized tables and having to make sure the name of
an entity never got out of sync between them.
I've
worked with several highly normalized systems in the past, and while
the learning curve to understand the data can be steeper than if you
were looking at a de-normalized table, report or spreadsheet, once
you understand the data, it makes getting what you need a much more
sane process.
Foreign
keys were the unsung heroes of this database. They’re not as
obviously helpful as normalization or other structural advantages,
but they prevent things from going wrong. Every day that I don’t
have orphaned records, they’re doing their job. Having the
assurance that the schema I've spent so long tinkering with and
refining will work correctly is well worth the effort. Sure, it takes
time to ensure I delete from tables in the correct order and don’t
add sub-entities before their parent, but I doubt anyone will say the
cost of learning to do things in the right order is worth having an
unstable database.
There
were certain instances where I made weighed decisions to avoid
foreign keys (such as on tables with very heavy traffic and/or when
an orphaned record would have no impact on the system, and I stand by
those decisions.
I’m
guessing at this point (if not earlier) normalization fans have been
thinking “we told you so” since the start of the article. These
are things I had no idea about when I started working with SQL. I was
educated in a certain way and the methods used by my mentors just
became engrained. I really had no significant exposure to highly
normalized databases or foreign keys when I started, and nobody took
the time to explain their implications (positive and negative), and I
wish someone had. There are still many cases where I forgo these
database practices and probably will again in the future, and maybe
some of the time they’ll come back to haunt me, but if I can give
anyone who doesn’t use these tools some advice, it would be to
strongly consider them. These were tools I dismissed because others
dismissed them, and I assumed the problems that arose from them were
just the way things were. I found excuses not to break form, but
fortunately over time my curiosity won out, and I’m glad it did. As
Robert Frost said “… And I, I took the [road] lets traveled by,
and that has made all the difference”.
No comments:
Post a Comment